Fast split Spark dataframe by keys in some column and save as different dataframes

Question

I have Spark 2.3 very big dataframe like this:

-------------------------
| col_key | col1 | col2 |
-------------------------
|      AA |    1 |    2 |
|      AB |    2 |    1 |
|      AA |    2 |    3 |
|      AC |    1 |    2 |
|      AA |    3 |    2 |
|      AC |    5 |    3 |
-------------------------

I need to "split" this dataframe by values in col_key column and save each splitted part in separate csv file, so I have to get smaller dataframes like

-------------------------
| col_key | col1 | col2 |
-------------------------
|      AA |    1 |    2 |
|      AA |    2 |    3 |
|      AA |    3 |    2 |
-------------------------

and

-------------------------
| col_key | col1 | col2 |
-------------------------
|      AC |    1 |    2 |
|      AC |    5 |    3 |
-------------------------

and so far. Every result dataframe I need to save as different csv file.

Count of keys is not big (20-30) but total count of data is (~200 millions records).

I have the solution where in the loop is selected every part of data and then saved to file:

val keysList = df.select("col_key").distinct().map(r => r.getString(0)).collect.toList

keysList.foreach(k => {
      val dfi = df.where($"col_key" === lit(k))
      SaveDataByKey(dfi, path_to_save)
    })

It works correct, but bad issue of this solution is that every selection of data by every key couse full passing through whole dataframe, and it get too many time. I think must be faster solution, where we pass through dataframe only once and during this put every record to "rigth" result dataframe (or directly to separate file). But I don't know how can to do it :) May be, someone have ideas about it?

Also I prefer to use Spark's DataFrame API because it provides fastest way of data processing (so using RDD's is not desirable, if possible).

What does SaveDataByKey do? Is what you want to do simply to save the dataframe into different folders partitioned on the col_key column? — Shaido
– Shaido, Commented May 9, 2019 at 8:58
I need to save data (extracted by different keys) to different csv files. SaveDataByKey does exactly this. — Ihor Konovalenko
– Ihor Konovalenko, Commented May 9, 2019 at 9:02
Possible duplicate of How do I split an RDD into two or more RDDs? — user10938362
– user10938362, Commented May 9, 2019 at 9:06
I don't think it is duplicate because I want to use Spark's DataFrame API only. If it is possible. — Ihor Konovalenko
– Ihor Konovalenko, Commented May 9, 2019 at 9:08

Muhunthan · Accepted Answer · 2019-05-09 09:05:19Z

1

You need to partition by column and save as csv file. Each partition save as one file.

yourDF
  .write
  .partitionBy("col_key")
  .csv("/path/to/save")

Why don't you try this ?

answered May 9, 2019 at 9:05

Muhunthan

4132 gold badges7 silver badges15 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Ihor Konovalenko Over a year ago

Thanks, it can be acceptable for my task. But there are two notes: 1) need column "col_key" also presented in result data; 2) need that result csv file would be single for each key.

Ihor Konovalenko Over a year ago

Regarding note 1): we can create duplicate column for col_key so that this duplicate would be in result data also. But what about automatically combine all csv files (created for some key) into one csv?

Collectives™ on Stack Overflow

Fast split Spark dataframe by keys in some column and save as different dataframes

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related