¿How do I read multiple files from multiple folders in Python

Question

I have to read a ''.parquet'' file that is in multiple folders of different years. This is not a problem when it is 1 or 2 years, however, the matter becomes more complicated when it is more than two years, since I must read the 12 subdirectories corresponding to each month. I show an example of how I do it in an inefficient way.

Step 1: Read files

YEAR 2019

df_2019_01=spark.read.parquet('/2019/01/name.parquet/')
df_2019_02=spark.read.parquet('/2019/02/name.parquet/')
df_2019_03=spark.read.parquet('/2019/03/name.parquet/')
df_2019_04=spark.read.parquet('/2019/04/name.parquet/')

#...

df_2019_12=spark.read.parquet('/2019/12/name.parquet/')

YEAR 2020

df_2020_01=spark.read.parquet('/2020/01/name.parquet/')
df_2020_02=spark.read.parquet('/2020/02/name.parquet/')
df_2020_03=spark.read.parquet('/2020/03/name.parquet/')
df_2020_04=spark.read.parquet('/2020/04/name.parquet/')

#...

df_2020_12=spark.read.parquet('/2020/12/name.parquet/')

Step 2: Union files (every month of every year). NOTE: 1) all files have the same structure; 2) the file name is the same in all folders.

df = df_2019_01.union(df_2019_02).union(df_2019_03).union(df_2019_04).union(df_2020_12)

Kafels · Accepted Answer · 2021-06-28 23:38:12Z

1

Change the year and month by *:

df = spark.read.parquet('/*/*/*.parquet')

All parquet files must have the same schema, otherwise your final dataframe will have missing columns. You can try this option:

mergedDF = spark.read.option("mergeSchema", "true").parquet('/*/*/*.parquet')

Your problem is similar to this question, if you want retrieve the year and month, just follow my answer there

edited Jun 28, 2021 at 23:38

answered Jun 28, 2021 at 16:25

Kafels

4,0891 gold badge18 silver badges33 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

¿How do I read multiple files from multiple folders in Python

Step 1: Read files

YEAR 2019

YEAR 2020

Step 2: Union files (every month of every year). NOTE: 1) all files have the same structure; 2) the file name is the same in all folders.

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

Step 1: Read files

YEAR 2019

YEAR 2020

Step 2: Union files (every month of every year). NOTE: 1) all files have the same structure; 2) the file name is the same in all folders.

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related