11

Hi I have a DataFrame as shown -

ID       X        Y

1      1234      284

1      1396      179

2      8620      178

3      1620      191

3      8820      828

I want split this DataFrame into multiple DataFrames based on ID. So for this example there will be 3 DataFrames. One way to achieve it is to run filter operation in loop. However, I would like to know if it can be done in much more efficient way.

4
  • Possible duplicate of How can I split a dataframe into dataframes with same column values in SCALA and SPARK Commented May 9, 2017 at 18:57
  • Yes. But I am looking for a pyspark version. Commented May 9, 2017 at 19:06
  • a more optimum solution can be made if the column is stored by partition, then we can perform the calculation parallely at different clusters Commented May 10, 2017 at 8:27
  • this is exactly i am trying to do. so far I am using parititionBy to store the data and load. I would like to know after doing partitionBy if I can split the dataframe into multiple dataframes based on the partitions. Commented May 10, 2017 at 23:54

2 Answers 2

10
#initialize spark dataframe
df = sc.parallelize([ (1,1234,282),(1,1396,179),(2,8620,178),(3,1620,191),(3,8820,828) ] ).toDF(["ID","X","Y"])

#get the list of unique ID values ; there's probably a better way to do this, but this was quick and easy
listids = [x.asDict().values()[0] for x in df.select("ID").distinct().collect()]
#create list of dataframes by IDs
dfArray = [df.where(df.ID == x) for x in listids]

dfArray[0].show()
+---+----+---+
| ID|   X|  Y|
+---+----+---+
|  1|1234|282|
|  1|1396|179|
+---+----+---+
dfArray[1].show()
+---+----+---+
| ID|   X|  Y|
+---+----+---+
|  2|8620|178|
+---+----+---+

dfArray[2].show()
+---+----+---+
| ID|   X|  Y|
+---+----+---+
|  3|1620|191|
|  3|8820|828|
+---+----+---+
Sign up to request clarification or add additional context in comments.

3 Comments

You are looping. I think this is the closest to what I am seeking for. stackoverflow.com/questions/41663985/… But it was I/O time associated with it.
if you want to "get something 'for' each something", there is going to be an inherent loop Somewhere
True But you can map the task to different partition and get a list of DFs. That is what I am trying to do.
6

The answer of @James Tobin needs to be altered a tiny bit if you are working with Python 3.X, as dict.values returns a dict-value object instead of a list. A quick workaround is just adding the list function:

listids = [list(x.asDict().values())[0] 
           for x in df.select("ID").distinct().collect()]

Posting as a seperate answer as I do not have the reputation required to put a comment on his answer.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.