0

I'm using pyspark sql functions from version 2.5.4. I have the following data in a pyspark.sql.dataframe:

 df = spark.createDataFrame(
    [
        (302, 'foo'), # values
        (203, 'bar'),
        (202, 'foo'),
        (202, 'bar'),
        (172, 'xxx'),
        (172, 'yyy'),
    ],
    ['LU', 'input'] # column labels
)

display(df)

What I would like to do is create a separate csv file for each 'LU'. So the csv's would look like this:

LU_302.csv

 LU_302 = spark.createDataFrame(
    [
        (302, 'foo'), # values
    ],
    ['LU', 'input'] # column labels
)

LU_203.csv

 LU_203 = spark.createDataFrame(
    [
        (203, 'bar'), # values
    ],
    ['LU', 'input'] # column labels
)

LU_202.csv

 LU_202 = spark.createDataFrame(
    [
        (202, 'foo'), # values
        (202, 'bar'), # values
    ],
    ['LU', 'input'] # column labels
)

LU_172.csv

 LU_202 = spark.createDataFrame(
    [
        (172, 'xxx'), # values
        (172, 'yyy'), # values
    ],
    ['LU', 'input'] # column labels
)

My separated dataframes here are spark dataframes but I would like them to be in csv - this is just for illustration purposes.

So you can see the dataframe has been split into separate dataframes using the 'LU' variable. I've been looking into how to do this using a while loop that runs over the dataframe and prints a new csv to a file path but can't find a solution.

Thanks

1

1 Answer 1

1

You can save the dataframe by using partition, like:

df.coalesce(1).write.partitionBy('LU').format('csv').option('header','true').save(file-path) 
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.