PYSPARK : casting string to float when reading a csv file

Question

I'm reading a csv file to dataframe

datafram = spark.read.csv(fileName, header=True)

but the data type in datafram is String, I want to change data type to float. Is there any way to do this efficiently?

you could specify the schema=

mtoto
– mtoto

2016-10-07 20:20:26 +00:00
Commented Oct 7, 2016 at 20:20 — mtoto
– mtoto, Commented Oct 7, 2016 at 20:20

Alberto Bonsanto · Accepted Answer · 2016-10-07 20:21:00Z

5

The most straightforward way to achieve this is by casting.

dataframe = dataframe.withColumn("float", col("column").cast("double"))

answered Oct 7, 2016 at 20:21

Alberto Bonsanto

18.1k10 gold badges67 silver badges93 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Mikel · Accepted Answer · 2016-11-08 11:48:18Z

If you want to do the casting when reading the CSV, you can use the inferSchema argument when reading the data. Let's try with a a small test csv file:

$ cat ../data/test.csv
a,b,c,d
5.0, 1.0, 1.0, 3.0
2.0, 0.0, 3.0, 4.0
4.0, 0.0, 0.0, 6.0

Now, if we read it as you did, we will have string values:

>>> df_csv = spark.read.csv("../data/test.csv", header=True)
>>> print(df_csv.dtypes)
[('a', 'string'), ('b', 'string'), ('c', 'string'), ('d', 'string')]

However, if we set inferSchema to True, it will correctly identify them as doubles:

>>> df_csv2 = spark.read.csv("../data/test.csv", header=True, inferSchema=True)
>>> print(df_csv2.dtypes)
[('a', 'double'), ('b', 'double'), ('c', 'double'), ('d', 'double')]

However, this approach requires another run over the data. You can find more information on the DataFrameReader CSV documentation.

Collectives™ on Stack Overflow

PYSPARK : casting string to float when reading a csv file

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related