PySpark - Split/Filter DataFrame by column's values

Question

I have a DataFrame similar to this example:

Timestamp | Word | Count

30/12/2015 | example_1 | 3

29/12/2015 | example_2 | 1

28/12/2015 | example_2 | 9

27/12/2015 | example_3 | 7

... | ... | ...

and i want to split this data frame by 'word' column's values to obtain a "list" of DataFrame (to plot some figures in a next step). For example:

DF1

Timestamp | Word | Count

30/12/2015 | example_1 | 3

DF2

Timestamp | Word | Count

29/12/2015 | example_2 | 1

28/12/2015 | example_2 | 9

DF3

Timestamp | Word | Count

27/12/2015 | example_3 | 7

Is there a way to do this with PySpark (1.6)?

Ganesh Krishnan · Accepted Answer · 2017-12-17 12:26:00Z

6

It won't be efficient but you can map with filter over the list of unique values:

words = df.select("Word").distinct().flatMap(lambda x: x).collect()
dfs = [df.where(df["Word"] == word) for word in words]

Post Spark 2.0

words = df.select("Word").distinct().rdd.flatMap(lambda x: x).collect()

edited Dec 17, 2017 at 12:26

Ganesh Krishnan

7,4052 gold badges46 silver badges52 bronze badges

answered Feb 4, 2016 at 0:23

zero323

331k108 gold badges982 silver badges958 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Ganesh Krishnan Over a year ago

Note that after Spark 2.0 the correct command would be words = df.select("Word").distinct().rdd.flatMap(lambda x: x).collect()

JustinCase · Accepted Answer · 2017-08-24 08:15:23Z

1

In addition to what zero323 said, I would might add

word.persist()

before the creation of the dfs, so the "words" dataframe won't need to be transformed each time when you will have an action on each of your "dfs"

answered Aug 24, 2017 at 8:15

JustinCase

2213 silver badges12 bronze badges

Collectives™ on Stack Overflow

PySpark - Split/Filter DataFrame by column's values

2 Answers 2

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related