PySpark: Split DataFrame into multiple DataFrames without using loop

Question

Hi I have a DataFrame as shown -

ID       X        Y

1      1234      284

1      1396      179

2      8620      178

3      1620      191

3      8820      828

I want split this DataFrame into multiple DataFrames based on ID. So for this example there will be 3 DataFrames. One way to achieve it is to run filter operation in loop. However, I would like to know if it can be done in much more efficient way.

Possible duplicate of How can I split a dataframe into dataframes with same column values in SCALA and SPARK — James Tobin
– James Tobin, Commented May 9, 2017 at 18:57
a more optimum solution can be made if the column is stored by partition, then we can perform the calculation parallely at different clusters — Ankit Kumar Namdeo
– Ankit Kumar Namdeo, Commented May 10, 2017 at 8:27
this is exactly i am trying to do. so far I am using parititionBy to store the data and load. I would like to know after doing partitionBy if I can split the dataframe into multiple dataframes based on the partitions. — sjishan
– sjishan, Commented May 10, 2017 at 23:54

James Tobin · Accepted Answer · 2017-05-09 19:16:42Z

10

#initialize spark dataframe
df = sc.parallelize([ (1,1234,282),(1,1396,179),(2,8620,178),(3,1620,191),(3,8820,828) ] ).toDF(["ID","X","Y"])

#get the list of unique ID values ; there's probably a better way to do this, but this was quick and easy
listids = [x.asDict().values()[0] for x in df.select("ID").distinct().collect()]
#create list of dataframes by IDs
dfArray = [df.where(df.ID == x) for x in listids]

dfArray[0].show()
+---+----+---+
| ID|   X|  Y|
+---+----+---+
|  1|1234|282|
|  1|1396|179|
+---+----+---+
dfArray[1].show()
+---+----+---+
| ID|   X|  Y|
+---+----+---+
|  2|8620|178|
+---+----+---+

dfArray[2].show()
+---+----+---+
| ID|   X|  Y|
+---+----+---+
|  3|1620|191|
|  3|8820|828|
+---+----+---+

answered May 9, 2017 at 19:16

James Tobin

3,12021 silver badges36 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

sjishan Over a year ago

You are looping. I think this is the closest to what I am seeking for. stackoverflow.com/questions/41663985/… But it was I/O time associated with it.

James Tobin Over a year ago

if you want to "get something 'for' each something", there is going to be an inherent loop Somewhere

sjishan Over a year ago

True But you can map the task to different partition and get a list of DFs. That is what I am trying to do.

rpanai · Accepted Answer · 2021-10-04 20:41:15Z

6

The answer of @James Tobin needs to be altered a tiny bit if you are working with Python 3.X, as dict.values returns a dict-value object instead of a list. A quick workaround is just adding the list function:

listids = [list(x.asDict().values())[0] 
           for x in df.select("ID").distinct().collect()]

Posting as a seperate answer as I do not have the reputation required to put a comment on his answer.

edited Oct 4, 2021 at 20:41

rpanai

13.5k3 gold badges48 silver badges65 bronze badges

answered Oct 4, 2021 at 15:03

Bebeerna

971 silver badge6 bronze badges

Collectives™ on Stack Overflow

PySpark: Split DataFrame into multiple DataFrames without using loop

2 Answers 2

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related