0

I have to read a ''.parquet'' file that is in multiple folders of different years. This is not a problem when it is 1 or 2 years, however, the matter becomes more complicated when it is more than two years, since I must read the 12 subdirectories corresponding to each month. I show an example of how I do it in an inefficient way.

Step 1: Read files

YEAR 2019

df_2019_01=spark.read.parquet('/2019/01/name.parquet/')
df_2019_02=spark.read.parquet('/2019/02/name.parquet/')
df_2019_03=spark.read.parquet('/2019/03/name.parquet/')
df_2019_04=spark.read.parquet('/2019/04/name.parquet/')

#...

df_2019_12=spark.read.parquet('/2019/12/name.parquet/')

YEAR 2020

df_2020_01=spark.read.parquet('/2020/01/name.parquet/')
df_2020_02=spark.read.parquet('/2020/02/name.parquet/')
df_2020_03=spark.read.parquet('/2020/03/name.parquet/')
df_2020_04=spark.read.parquet('/2020/04/name.parquet/')

#...

df_2020_12=spark.read.parquet('/2020/12/name.parquet/')

Step 2: Union files (every month of every year). NOTE: 1) all files have the same structure; 2) the file name is the same in all folders.

df = df_2019_01.union(df_2019_02).union(df_2019_03).union(df_2019_04).union(df_2020_12)
0

1 Answer 1

1

Change the year and month by *:

df = spark.read.parquet('/*/*/*.parquet')

All parquet files must have the same schema, otherwise your final dataframe will have missing columns. You can try this option:

mergedDF = spark.read.option("mergeSchema", "true").parquet('/*/*/*.parquet')

Your problem is similar to this question, if you want retrieve the year and month, just follow my answer there

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.