Read in CSV in Pyspark with correct Datatypes

Question

When I am trying to import a local CSV with spark, every column is by default read in as a string. However, my columns only include integers and a timestamp type. To be more specific, the CSV looks like this:

"Customer","TransDate","Quantity","PurchAmount","Cost","TransID","TransKey"
149332,"15.11.2005",1,199.95,107,127998739,100000

I have found code that should work in this question, but when I execute it all the entries are returned as NULL.

I use the following to create a custom schema:

from pyspark.sql.types import LongType, StringType, StructField, StructType, BooleanType, ArrayType, IntegerType, TimestampType

customSchema = StructType(Array(
        StructField("Customer", IntegerType, true),
        StructField("TransDate", TimestampType, true),
        StructField("Quantity", IntegerType, true),
        StructField("Cost", IntegerType, true),
        StructField("TransKey", IntegerType, true)))

and then read in the CSV with:

myData = spark.read.load('myData.csv', format="csv", header="true", sep=',', schema=customSchema)

Which returns:

+--------+---------+--------+----+--------+
|Customer|TransDate|Quantity|Cost|Transkey|
+--------+---------+--------+----+--------+
|    null|     null|    null|null|    null|
+--------+---------+--------+----+--------+

Am I missing a crucial step? I suspect that the Date column is the root of the problem. Note: I am running this in GoogleCollab.

I'm surprised that the integers are being read incorrectly. Those dates definitely won't work because they're not in the YYYY-MM-DD format that's expected. I would recommend reading the csv using inferSchema = True (For example" myData = spark.read.csv("myData.csv", header=True, inferSchema=True)) and then manually converting the Timestamp fields from string to date. — pault
– pault, Commented Oct 26, 2018 at 17:01
Oh now I see the problem: you passed in header="true" instead of header=True. You need to pass it as a boolean, but you'll still get nulls for the timestamps because of the incorrect format. — pault
– pault, Commented Oct 26, 2018 at 17:04
@Prazy though the documentation is unclear, I am pretty sure that header should be one of (False, True, None) (boolean/None vs. string). — pault
– pault, Commented Oct 26, 2018 at 18:48

Vikrant Singh Rana · Accepted Answer · 2018-10-29 06:36:00Z

Here you go!

"Customer","TransDate","Quantity","PurchAmount","Cost","TransID","TransKey"
149332,"15.11.2005",1,199.95,107,127998739,100000
PATH_TO_FILE="file:///u/vikrant/LocalTestDateFile"
Loading above file to dataframe:
df = spark.read.format("com.databricks.spark.csv") \
  .option("mode", "DROPMALFORMED") \
  .option("header", "true") \
  .option("inferschema", "true") \
  .option("delimiter", ",").load(PATH_TO_FILE)

your date will get loaded as string column type, but the moment you change it to date type it will treat this date format as NULL.

df = (df.withColumn('TransDate',col('TransDate').cast('date'))

+--------+---------+--------+-----------+----+---------+--------+
|Customer|TransDate|Quantity|PurchAmount|Cost|  TransID|TransKey|
+--------+---------+--------+-----------+----+---------+--------+
|  149332|     null|       1|     199.95| 107|127998739|  100000|
+--------+---------+--------+-----------+----+---------+--------+

So we need to change the date format from dd.mm.yy to yy-mm-dd.

from datetime import datetime
from pyspark.sql.functions import col, udf
from pyspark.sql.types import DateType
from pyspark.sql.functions import col

Python function to change the date format:

  change_dateformat_func =  udf (lambda x: datetime.strptime(x, '%d.%m.%Y').strftime('%Y-%m-%d'))

call this function for your dataframe column now:

newdf = df.withColumn('TransDate', change_dateformat_func(col('TransDate')).cast(DateType()))

+--------+----------+--------+-----------+----+---------+--------+
|Customer| TransDate|Quantity|PurchAmount|Cost|  TransID|TransKey|
+--------+----------+--------+-----------+----+---------+--------+
|  149332|2005-11-15|       1|     199.95| 107|127998739|  100000|
+--------+----------+--------+-----------+----+---------+--------+

and below is the Schema:

 |-- Customer: integer (nullable = true)
 |-- TransDate: date (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- PurchAmount: double (nullable = true)
 |-- Cost: integer (nullable = true)
 |-- TransID: integer (nullable = true)
 |-- TransKey: integer (nullable = true)

Let me know if it works for you.

Raghu Paredla · Accepted Answer · 2023-07-25 11:37:45Z

2

You can specifiy an option ('dateFormat','d.M.y') to the DataFrameReader to parse date in particular format.

df = spark.read.format("csv").option("header","true").option("dateFormat","M.d.y").schema(my_schema).load("path_to_csv")

Reference

edited Jul 25, 2023 at 11:37

answered Oct 15, 2020 at 9:56

Raghu Paredla

614 bronze badges

Collectives™ on Stack Overflow

Read in CSV in Pyspark with correct Datatypes

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related