pyspark load csv file into dataframe using a schema

Question

I am new to pyspark and working on pyspark with Spark version 2.2.0 and Python version 2.7.12

I am trying to read 2 .csv files (has more than 1 header row) into 2 different dataframes with known schema and perform comparison operations.

I am unsure if there is any optimal/better method to create schema file (Includes Column name, datatype, nullability) and refer it in pyspark program to load into a dataframe.

I have coded as following for first file:

Create an yaml file to store file paths, and schema
Read the schema file and construct StructField(column name, datatype, nullanbility) dynamically in a loop. Example: [StructField(column1,Integer,true), StructField(column2,string,true), StructField(column3,decimal(10,2),true), .... ]
Read the data file into RDD and remove 2 header rows (Will use subtract function)
Create dataframe using sqlContext.createDataFrame by passing RDD, schema structure.

I am able to do these steps for a sample data of first file.

Please suggest if there is any better way (I am yet to explore fromDDL option for StructType). After similar dataframe is created for second file, there are functional logic to be applied.

Thank you

just trying understand, each csv file of yours has '1' header . so, '2' csv file you have and want to "remove 2 header rows" with approach you said ? — Karthick
– Karthick, Commented Sep 8, 2018 at 12:44
I am able to do those 4 steps & create dataframe. Keeping open to understand if there is any better way. — msashish
– msashish, Commented Sep 9, 2018 at 10:53

RonD · Accepted Answer · 2018-09-09 19:12:37Z

0

How about reading in the file using pyspark spark.read.csv with stucttype for schema, options header=true and mode=DROPMALFORMED which would ignore any records that dont match schema.

answered Sep 9, 2018 at 19:12

RonD

851 silver badge10 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

msashish Over a year ago

Hi Ron D, It did not work keeping header=true. Instead what I did is to just enforce schema and not specify header. After that I filtered out 2 header records using dropna option. Keeping the question open for understanding which is a better method.

msashish · Accepted Answer · 2018-09-12 05:09:44Z

0

I am able to do this using yaml configuration file (store schema) and read from pyspark to dynamically construct the StructType.

It is working and meeting requirements. if there are any better methods, happy to hear.

answered Sep 12, 2018 at 5:09

msashish

3372 gold badges7 silver badges19 bronze badges

Collectives™ on Stack Overflow

pyspark load csv file into dataframe using a schema

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related