1

I have a dataframe say df1 with 10M rows. I want to split the same to multiple csv files with 1M rows each. Any suggestions to do the same in scala?

1 Answer 1

1

You can use the randomSplit method on Dataframes.

import scala.util.Random
val df = List(0,1,2,3,4,5,6,7,8,9).toDF
val splitted = df.randomSplit(Array(1,1,1,1,1)) 
splitted foreach { a => a.write.format("csv").save("path" + Random.nextInt) }

I used the Random.nextInt to have a unique name. You can add some other logic there if necessary.

Source:

http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset

How to save a spark DataFrame as csv on disk?

https://forums.databricks.com/questions/8723/how-can-i-split-a-spark-dataframe-into-n-equal-dat.html

Edit: An alternative approach is to use limit and except:

var input = List(1,2,3,4,5,6,7,8,9).toDF
val limit = 2

var newFrames = List[org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]]()
var size = input.count;

while (size > 0) {
    newFrames = input.limit(limit) :: newFrames
    input = input.except(newFrames.head)
    size = size - limit
}

newFrames.foreach(_.show)

The first element in the resulting list may contain less element than the rest of the list.

Sign up to request clarification or add additional context in comments.

3 Comments

@Steffen..My requirement is to have fixed number of rows per csv. Also, the number if records in csv are not fixed. If the master file has 10M rows, 10 csv's of 1M records each should be created. Similarly for 20M records 20 csv's of 1M records should be created. This example does not suffice this problem.
stackoverflow.com/questions/41223125/… This provides an example in scala code on how to do this. The number of partitions should be the length of your dataset divided by the number of rows per partition.
@Nitish I added an approach that may solve your problem based on an answer to this question: stackoverflow.com/questions/44135610/…

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.