PysparkSQL dataframe - Split dataframe into multiple files

Question

I'm using pyspark sql functions from version 2.5.4. I have the following data in a pyspark.sql.dataframe:

 df = spark.createDataFrame(
    [
        (302, 'foo'), # values
        (203, 'bar'),
        (202, 'foo'),
        (202, 'bar'),
        (172, 'xxx'),
        (172, 'yyy'),
    ],
    ['LU', 'input'] # column labels
)

display(df)

What I would like to do is create a separate csv file for each 'LU'. So the csv's would look like this:

LU_302.csv

 LU_302 = spark.createDataFrame(
    [
        (302, 'foo'), # values
    ],
    ['LU', 'input'] # column labels
)

LU_203.csv

 LU_203 = spark.createDataFrame(
    [
        (203, 'bar'), # values
    ],
    ['LU', 'input'] # column labels
)

LU_202.csv

 LU_202 = spark.createDataFrame(
    [
        (202, 'foo'), # values
        (202, 'bar'), # values
    ],
    ['LU', 'input'] # column labels
)

LU_172.csv

 LU_202 = spark.createDataFrame(
    [
        (172, 'xxx'), # values
        (172, 'yyy'), # values
    ],
    ['LU', 'input'] # column labels
)

My separated dataframes here are spark dataframes but I would like them to be in csv - this is just for illustration purposes.

So you can see the dataframe has been split into separate dataframes using the 'LU' variable. I've been looking into how to do this using a while loop that runs over the dataframe and prints a new csv to a file path but can't find a solution.

Thanks

stackoverflow.com/questions/60048027/… that should help. u dont use loops in spark, you have to use inbuilt functions to do work in parallel. — murtihash
– murtihash, Commented Apr 7, 2020 at 15:31

Rahul · Accepted Answer · 2020-04-09 05:57:34Z

1

You can save the dataframe by using partition, like:

df.coalesce(1).write.partitionBy('LU').format('csv').option('header','true').save(file-path)

answered Apr 9, 2020 at 5:57

Rahul

7739 silver badges16 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

PysparkSQL dataframe - Split dataframe into multiple files

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related