1

I can group large datasets and make multiple CSV, excel files with Pandas data frame. But how to do the same with the Pyspark data frame to group 700K records into around 230 groups and make 230 CSV files country wise.

Using pandas

grouped = df.groupby("country_code")

# run this to generate separate Excel files
for country_code, group in grouped:
    group.to_excel(excel_writer=f"{country_code}.xlsx", sheet_name=country_code, index=False)

with Pyspark data frame, when I try to like this-

for country_code, df_country in df.groupBy('country_code'):
    print(country_code,df_country.show(1))

It returns,

TypeError: 'GroupedData' object is not iterable

2 Answers 2

3

If your requirement is to save all country data in different files you can achieve it by partitioning the data but instead of file you will get folder for each country because spark can't save data to file directly.

Spark creates folder whenever a dataframe writer is called.

df.write.partitionBy('country_code').csv(path)

The output will be multiple folders with corresponding country's data

path/country_code=india/part-0000.csv
path/country_code=australia/part-0000.csv

If you want one file inside each folder you can repartition your data as

df.repartition('country_code').write.partitionBy('country_code').csv(path)
Sign up to request clarification or add additional context in comments.

2 Comments

some of my column contains array data structure, showing error AnalysisException: CSV data source does not support array<struct<id:string,type:string>> data type.; after using df.repartition('country_code').write.partitionBy('country_code').csv('grouped_data/') I think I need to cast the array to string before partinionBy to csv
You can explode/flatten the array or save the data as json or parquet file
1

Use partitionBy at the time of writing so that every partition is based on the column you specify (country_code in your case).

Here's more on this.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.