I am new to pyspark and working on pyspark with Spark version 2.2.0 and Python version 2.7.12
I am trying to read 2 .csv files (has more than 1 header row) into 2 different dataframes with known schema and perform comparison operations.
I am unsure if there is any optimal/better method to create schema file (Includes Column name, datatype, nullability) and refer it in pyspark program to load into a dataframe.
I have coded as following for first file:
Create an yaml file to store file paths, and schema
Read the schema file and construct StructField(column name, datatype, nullanbility) dynamically in a loop. Example: [StructField(column1,Integer,true), StructField(column2,string,true), StructField(column3,decimal(10,2),true), .... ]
Read the data file into RDD and remove 2 header rows (Will use subtract function)
Create dataframe using sqlContext.createDataFrame by passing RDD, schema structure.
I am able to do these steps for a sample data of first file.
Please suggest if there is any better way (I am yet to explore fromDDL option for StructType). After similar dataframe is created for second file, there are functional logic to be applied.
Thank you