PySpark Create DataFrame With Float TypeError

Question

I have Data Sets as Below:

I am using PySpark to parse the data and create a DataFrame later using below code:

from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql import functions as f

def parseInput(line):
    fields = line.split(',')
    stationID=fields[0]
    entryType=fields[2]
    temperature= fields[3]*0.3
    return Row(stationID,entryType,temperature)

spark = SparkSession.builder.appName("MinTemperatures").getOrCreate()
lines = spark.sparkContext.textFile("data/1800.csv")
temperatures = lines.map(parseInput)
minTemps=temperatures.filter(lambda x:x[1]=='TMIN')
df = spark.createDataFrame(minTemps)

I got below error:

TypeError: can't multiply sequence by non-int of type 'float'

Obviously, if I remove 0.3 out of temperature= fields[3]*0.3, the create DataFrame work. How can I return the temperature with float number and some basic math operation?

Addy · Accepted Answer · 2020-07-11 02:54:37Z

2

Try temperature= float(fields[3])*0.3

answered Jul 11, 2020 at 2:54

Addy

4274 silver badges13 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Chelseajcole Over a year ago

Thank you for the answer

Simon · Accepted Answer · 2020-07-11 05:08:43Z

1

You can read the file without multiplication first and then cast it to Type Double, do the multiplication finally.

I assume your csv file have header.
The following code is for casting:

data = data.withColumn("COLUMN_NAME", data["COLUMN_NAME"].cast("double"))

answered Jul 11, 2020 at 5:08

Simon

1676 bronze badges

1 Comment

Chelseajcole Over a year ago

Thank you for the answer

Collectives™ on Stack Overflow

PySpark Create DataFrame With Float TypeError

2 Answers 2

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related