1

I have a csv file having 300 columns. Out of these 300 columns, I need only 3 columns. Hence i defines schema for same. But when I am mapping schema to dataframe it shows only 3 columns but incorretly mapping schema with first 3 columns. Its not mapping csv columns names with my schema structfields. Please advise


from pyspark.sql.types import *

dfschema = StructType([
    StructField("Call Number",IntegerType(),True),
    StructField("Incident Number",IntegerType(),True),
    StructField("Entry DtTm",DateType() ,True)
]) 

df = spark.read.format("csv")\
             .option("header","true")\
             .schema(dfschema)\
             .load("/FileStore/*/*")
df.show(5)
1
  • Could you please include the output of df.printSchema() after your last line? Commented Sep 4, 2022 at 4:29

1 Answer 1

1

This is actually the expected behaviour of Spark's CSV-Reader.

If the columns in the csv file do not match the supplied schema, Spark treats the row as a corrupt record. The easiest way to see that is to add another column _corrupt_record with type string to the schema. You will see that all rows are stored in this column.

The easiest way to get the correct columns would be to read the the csv file without schema (or if feasible with the complete schema) and then select the required columns. There will be no performance penalty for reading the whole csv file as (unlike in formats like parquet) Spark cannot read selected columns from csv. The file is always read completely.

#read the csv file without infering the schema
df=spark.read.option("header","true").option("inferSchema", False).csv(<...>)

#all columns will now be of type string
df.printSchema()

#select the required columns and cast them to the appropriate type
df2 = df.selectExpr("cast(`Call Number` as int)", "cast(`Incident Number` as int)", ....)

#only the required columns with the correct type are contained in df2
df2.printSchema()
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks.. Making schema for 300 columns will not be a feasible option, If i go by your second approach , reading csv without schema, Then how can i assign schema to data frame,
I have added a small code example

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.