Pyspark make multiple files based on dataframe groupBy

Question

I can group large datasets and make multiple CSV, excel files with Pandas data frame. But how to do the same with the Pyspark data frame to group 700K records into around 230 groups and make 230 CSV files country wise.

Using pandas

grouped = df.groupby("country_code")

# run this to generate separate Excel files
for country_code, group in grouped:
    group.to_excel(excel_writer=f"{country_code}.xlsx", sheet_name=country_code, index=False)

with Pyspark data frame, when I try to like this-

for country_code, df_country in df.groupBy('country_code'):
    print(country_code,df_country.show(1))

It returns,

TypeError: 'GroupedData' object is not iterable

Shubham Jain · Accepted Answer · 2020-07-29 05:24:41Z

3

If your requirement is to save all country data in different files you can achieve it by partitioning the data but instead of file you will get folder for each country because spark can't save data to file directly.

Spark creates folder whenever a dataframe writer is called.

df.write.partitionBy('country_code').csv(path)

The output will be multiple folders with corresponding country's data

path/country_code=india/part-0000.csv
path/country_code=australia/part-0000.csv

If you want one file inside each folder you can repartition your data as

df.repartition('country_code').write.partitionBy('country_code').csv(path)

answered Jul 29, 2020 at 5:24

Shubham Jain

5,6162 gold badges20 silver badges42 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

A l w a y s S u n n y Over a year ago

some of my column contains array data structure, showing error AnalysisException: CSV data source does not support array<struct<id:string,type:string>> data type.; after using df.repartition('country_code').write.partitionBy('country_code').csv('grouped_data/') I think I need to cast the array to string before partinionBy to csv

Shubham Jain Over a year ago

You can explode/flatten the array or save the data as json or parquet file

Surya Shekhar Chakraborty · Accepted Answer · 2020-07-29 05:24:30Z

1

Use partitionBy at the time of writing so that every partition is based on the column you specify (country_code in your case).

Here's more on this.

answered Jul 29, 2020 at 5:24

Surya Shekhar Chakraborty

3513 silver badges12 bronze badges

Collectives™ on Stack Overflow

Pyspark make multiple files based on dataframe groupBy

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related