Applying schema on pyspark dataframe

Question

I have a csv file having 300 columns. Out of these 300 columns, I need only 3 columns. Hence i defines schema for same. But when I am mapping schema to dataframe it shows only 3 columns but incorretly mapping schema with first 3 columns. Its not mapping csv columns names with my schema structfields. Please advise


from pyspark.sql.types import *

dfschema = StructType([
    StructField("Call Number",IntegerType(),True),
    StructField("Incident Number",IntegerType(),True),
    StructField("Entry DtTm",DateType() ,True)
]) 

df = spark.read.format("csv")\
             .option("header","true")\
             .schema(dfschema)\
             .load("/FileStore/*/*")
df.show(5)

Could you please include the output of df.printSchema() after your last line? — Cold Fish
– Cold Fish, Commented Sep 4, 2022 at 4:29

werner · Accepted Answer · 2022-09-05 17:24:02Z

1

This is actually the expected behaviour of Spark's CSV-Reader.

If the columns in the csv file do not match the supplied schema, Spark treats the row as a corrupt record. The easiest way to see that is to add another column _corrupt_record with type string to the schema. You will see that all rows are stored in this column.

The easiest way to get the correct columns would be to read the the csv file without schema (or if feasible with the complete schema) and then select the required columns. There will be no performance penalty for reading the whole csv file as (unlike in formats like parquet) Spark cannot read selected columns from csv. The file is always read completely.

#read the csv file without infering the schema
df=spark.read.option("header","true").option("inferSchema", False).csv(<...>)

#all columns will now be of type string
df.printSchema()

#select the required columns and cast them to the appropriate type
df2 = df.selectExpr("cast(`Call Number` as int)", "cast(`Incident Number` as int)", ....)

#only the required columns with the correct type are contained in df2
df2.printSchema()

edited Sep 5, 2022 at 17:24

answered Sep 4, 2022 at 15:52

werner

15k6 gold badges36 silver badges56 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Vikas Over a year ago

Thanks.. Making schema for 300 columns will not be a feasible option, If i go by your second approach , reading csv without schema, Then how can i assign schema to data frame,

werner Over a year ago

I have added a small code example

Collectives™ on Stack Overflow

Applying schema on pyspark dataframe

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related