9

When I am trying to import a local CSV with spark, every column is by default read in as a string. However, my columns only include integers and a timestamp type. To be more specific, the CSV looks like this:

"Customer","TransDate","Quantity","PurchAmount","Cost","TransID","TransKey"
149332,"15.11.2005",1,199.95,107,127998739,100000

I have found code that should work in this question, but when I execute it all the entries are returned as NULL.

I use the following to create a custom schema:

from pyspark.sql.types import LongType, StringType, StructField, StructType, BooleanType, ArrayType, IntegerType, TimestampType

customSchema = StructType(Array(
        StructField("Customer", IntegerType, true),
        StructField("TransDate", TimestampType, true),
        StructField("Quantity", IntegerType, true),
        StructField("Cost", IntegerType, true),
        StructField("TransKey", IntegerType, true)))

and then read in the CSV with:

myData = spark.read.load('myData.csv', format="csv", header="true", sep=',', schema=customSchema)

Which returns:

+--------+---------+--------+----+--------+
|Customer|TransDate|Quantity|Cost|Transkey|
+--------+---------+--------+----+--------+
|    null|     null|    null|null|    null|
+--------+---------+--------+----+--------+

Am I missing a crucial step? I suspect that the Date column is the root of the problem. Note: I am running this in GoogleCollab.

8
  • 1
    I'm surprised that the integers are being read incorrectly. Those dates definitely won't work because they're not in the YYYY-MM-DD format that's expected. I would recommend reading the csv using inferSchema = True (For example" myData = spark.read.csv("myData.csv", header=True, inferSchema=True)) and then manually converting the Timestamp fields from string to date. Commented Oct 26, 2018 at 17:01
  • Oh now I see the problem: you passed in header="true" instead of header=True. You need to pass it as a boolean, but you'll still get nulls for the timestamps because of the incorrect format. Commented Oct 26, 2018 at 17:04
  • What is wrong in? header = "true" Commented Oct 26, 2018 at 17:32
  • @Prazy though the documentation is unclear, I am pretty sure that header should be one of (False, True, None) (boolean/None vs. string). Commented Oct 26, 2018 at 18:48
  • @pault header = "true" always works for me. Commented Oct 26, 2018 at 18:50

2 Answers 2

5

Here you go!

"Customer","TransDate","Quantity","PurchAmount","Cost","TransID","TransKey"
149332,"15.11.2005",1,199.95,107,127998739,100000
PATH_TO_FILE="file:///u/vikrant/LocalTestDateFile"
Loading above file to dataframe:
df = spark.read.format("com.databricks.spark.csv") \
  .option("mode", "DROPMALFORMED") \
  .option("header", "true") \
  .option("inferschema", "true") \
  .option("delimiter", ",").load(PATH_TO_FILE)

your date will get loaded as string column type, but the moment you change it to date type it will treat this date format as NULL.

df = (df.withColumn('TransDate',col('TransDate').cast('date'))

+--------+---------+--------+-----------+----+---------+--------+
|Customer|TransDate|Quantity|PurchAmount|Cost|  TransID|TransKey|
+--------+---------+--------+-----------+----+---------+--------+
|  149332|     null|       1|     199.95| 107|127998739|  100000|
+--------+---------+--------+-----------+----+---------+--------+

So we need to change the date format from dd.mm.yy to yy-mm-dd.

from datetime import datetime
from pyspark.sql.functions import col, udf
from pyspark.sql.types import DateType
from pyspark.sql.functions import col

Python function to change the date format:

  change_dateformat_func =  udf (lambda x: datetime.strptime(x, '%d.%m.%Y').strftime('%Y-%m-%d'))

call this function for your dataframe column now:

newdf = df.withColumn('TransDate', change_dateformat_func(col('TransDate')).cast(DateType()))

+--------+----------+--------+-----------+----+---------+--------+
|Customer| TransDate|Quantity|PurchAmount|Cost|  TransID|TransKey|
+--------+----------+--------+-----------+----+---------+--------+
|  149332|2005-11-15|       1|     199.95| 107|127998739|  100000|
+--------+----------+--------+-----------+----+---------+--------+

and below is the Schema:

 |-- Customer: integer (nullable = true)
 |-- TransDate: date (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- PurchAmount: double (nullable = true)
 |-- Cost: integer (nullable = true)
 |-- TransID: integer (nullable = true)
 |-- TransKey: integer (nullable = true)

Let me know if it works for you.

Sign up to request clarification or add additional context in comments.

Comments

2

You can specifiy an option ('dateFormat','d.M.y') to the DataFrameReader to parse date in particular format.

df = spark.read.format("csv").option("header","true").option("dateFormat","M.d.y").schema(my_schema).load("path_to_csv")

Reference

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.