When I am trying to import a local CSV with spark, every column is by default read in as a string. However, my columns only include integers and a timestamp type. To be more specific, the CSV looks like this:
"Customer","TransDate","Quantity","PurchAmount","Cost","TransID","TransKey"
149332,"15.11.2005",1,199.95,107,127998739,100000
I have found code that should work in this question, but when I execute it all the entries are returned as NULL.
I use the following to create a custom schema:
from pyspark.sql.types import LongType, StringType, StructField, StructType, BooleanType, ArrayType, IntegerType, TimestampType
customSchema = StructType(Array(
StructField("Customer", IntegerType, true),
StructField("TransDate", TimestampType, true),
StructField("Quantity", IntegerType, true),
StructField("Cost", IntegerType, true),
StructField("TransKey", IntegerType, true)))
and then read in the CSV with:
myData = spark.read.load('myData.csv', format="csv", header="true", sep=',', schema=customSchema)
Which returns:
+--------+---------+--------+----+--------+
|Customer|TransDate|Quantity|Cost|Transkey|
+--------+---------+--------+----+--------+
| null| null| null|null| null|
+--------+---------+--------+----+--------+
Am I missing a crucial step? I suspect that the Date column is the root of the problem. Note: I am running this in GoogleCollab.
YYYY-MM-DDformat that's expected. I would recommend reading the csv usinginferSchema = True(For example"myData = spark.read.csv("myData.csv", header=True, inferSchema=True)) and then manually converting the Timestamp fields from string to date.header="true"instead ofheader=True. You need to pass it as a boolean, but you'll still get nulls for the timestamps because of the incorrect format.header = "true"headershould be one of(False, True, None)(boolean/None vs. string).