2

I am new to pyspark and working on pyspark with Spark version 2.2.0 and Python version 2.7.12

I am trying to read 2 .csv files (has more than 1 header row) into 2 different dataframes with known schema and perform comparison operations.

I am unsure if there is any optimal/better method to create schema file (Includes Column name, datatype, nullability) and refer it in pyspark program to load into a dataframe.

I have coded as following for first file:

  1. Create an yaml file to store file paths, and schema

  2. Read the schema file and construct StructField(column name, datatype, nullanbility) dynamically in a loop. Example: [StructField(column1,Integer,true), StructField(column2,string,true), StructField(column3,decimal(10,2),true), .... ]

  3. Read the data file into RDD and remove 2 header rows (Will use subtract function)

  4. Create dataframe using sqlContext.createDataFrame by passing RDD, schema structure.

I am able to do these steps for a sample data of first file.

Please suggest if there is any better way (I am yet to explore fromDDL option for StructType). After similar dataframe is created for second file, there are functional logic to be applied.

Thank you

3
  • just trying understand, each csv file of yours has '1' header . so, '2' csv file you have and want to "remove 2 header rows" with approach you said ? Commented Sep 8, 2018 at 12:44
  • csv files have 2 headers each Commented Sep 8, 2018 at 13:13
  • I am able to do those 4 steps & create dataframe. Keeping open to understand if there is any better way. Commented Sep 9, 2018 at 10:53

2 Answers 2

0

How about reading in the file using pyspark spark.read.csv with stucttype for schema, options header=true and mode=DROPMALFORMED which would ignore any records that dont match schema.

Sign up to request clarification or add additional context in comments.

1 Comment

Hi Ron D, It did not work keeping header=true. Instead what I did is to just enforce schema and not specify header. After that I filtered out 2 header records using dropna option. Keeping the question open for understanding which is a better method.
0

I am able to do this using yaml configuration file (store schema) and read from pyspark to dynamically construct the StructType.

It is working and meeting requirements. if there are any better methods, happy to hear.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.