10

In PySpark 1.6 DataFrame currently there is no Spark builtin function to convert from string to float/double.

Assume, we have a RDD with ('house_name', 'price') with both values as string. You would like to convert, price from string to float. In PySpark, we can apply map and python float function to achieve this.

New_RDD = RawDataRDD.map(lambda (house_name, price): (house_name, float(x.price))    # this works

In PySpark 1.6 Dataframe, it does not work:

New_DF = rawdataDF.select('house name', float('price')) # did not work

Until a built in Pyspark function available, how to do achieve this conversion with a UDF? I developed this conversion UDF as follows:

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def string_to_float(x):
    return float(x)

udfstring_to_float = udf(string_to_float, StringType())
rawdata.withColumn("house name", udfstring_to_float("price"))

Is there a better and much simpler way to achieve the same?

2 Answers 2

11

According to the documentation, you can use the cast function on a column like this:

rawdata.withColumn("house name", rawdata["price"].cast(DoubleType()).alias("price"))
Sign up to request clarification or add additional context in comments.

3 Comments

This does not work for me @Jaco. The OP says he is using pyspark 1.6 and the documentation you linked to is 1.3. When i try this on 1.6 i get AttributeError: 'DoubleType' object has no attribute 'alias'
Do you have the import from pyspark.sql.types import DoubleType ? I am sure I tested this on PySpark 1.6 before posting.
FIX: Should be rawdata.withColumn("house name",rawdata["price"].cast(DoubleType()).alias("price") instead
4

The answer should be as follows:

>>> rawdata.printSchema()
root
 |-- house name: string (nullable = true)
 |-- price: string (nullable = true)

>>> rawdata=rawdata.withColumn('price',rawdata['price'].cast("float").alias('price'))

>>> rawdata.printSchema()
root
 |-- house name: string (nullable = true)
 |-- price: float (nullable = true)

As it is the shortest one-line code without using any user-defined function. You can see whether it worked correctly by using printSchema() function.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.