I can group large datasets and make multiple CSV, excel files with Pandas data frame. But how to do the same with the Pyspark data frame to group 700K records into around 230 groups and make 230 CSV files country wise.
Using pandas
grouped = df.groupby("country_code")
# run this to generate separate Excel files
for country_code, group in grouped:
group.to_excel(excel_writer=f"{country_code}.xlsx", sheet_name=country_code, index=False)
with Pyspark data frame, when I try to like this-
for country_code, df_country in df.groupBy('country_code'):
print(country_code,df_country.show(1))
It returns,
TypeError: 'GroupedData' object is not iterable