0

I have Spark 2.3 very big dataframe like this:

-------------------------
| col_key | col1 | col2 |
-------------------------
|      AA |    1 |    2 |
|      AB |    2 |    1 |
|      AA |    2 |    3 |
|      AC |    1 |    2 |
|      AA |    3 |    2 |
|      AC |    5 |    3 |
-------------------------

I need to "split" this dataframe by values in col_key column and save each splitted part in separate csv file, so I have to get smaller dataframes like

-------------------------
| col_key | col1 | col2 |
-------------------------
|      AA |    1 |    2 |
|      AA |    2 |    3 |
|      AA |    3 |    2 |
-------------------------

and

-------------------------
| col_key | col1 | col2 |
-------------------------
|      AC |    1 |    2 |
|      AC |    5 |    3 |
-------------------------

and so far. Every result dataframe I need to save as different csv file.

Count of keys is not big (20-30) but total count of data is (~200 millions records).

I have the solution where in the loop is selected every part of data and then saved to file:

val keysList = df.select("col_key").distinct().map(r => r.getString(0)).collect.toList

keysList.foreach(k => {
      val dfi = df.where($"col_key" === lit(k))
      SaveDataByKey(dfi, path_to_save)
    })

It works correct, but bad issue of this solution is that every selection of data by every key couse full passing through whole dataframe, and it get too many time. I think must be faster solution, where we pass through dataframe only once and during this put every record to "rigth" result dataframe (or directly to separate file). But I don't know how can to do it :) May be, someone have ideas about it?

Also I prefer to use Spark's DataFrame API because it provides fastest way of data processing (so using RDD's is not desirable, if possible).

4
  • What does SaveDataByKey do? Is what you want to do simply to save the dataframe into different folders partitioned on the col_key column? Commented May 9, 2019 at 8:58
  • I need to save data (extracted by different keys) to different csv files. SaveDataByKey does exactly this. Commented May 9, 2019 at 9:02
  • Possible duplicate of How do I split an RDD into two or more RDDs? Commented May 9, 2019 at 9:06
  • I don't think it is duplicate because I want to use Spark's DataFrame API only. If it is possible. Commented May 9, 2019 at 9:08

1 Answer 1

1

You need to partition by column and save as csv file. Each partition save as one file.

yourDF
  .write
  .partitionBy("col_key")
  .csv("/path/to/save")

Why don't you try this ?

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks, it can be acceptable for my task. But there are two notes: 1) need column "col_key" also presented in result data; 2) need that result csv file would be single for each key.
Regarding note 1): we can create duplicate column for col_key so that this duplicate would be in result data also. But what about automatically combine all csv files (created for some key) into one csv?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.