6

I have a DataFrame similar to this example:

Timestamp | Word | Count

30/12/2015 | example_1 | 3

29/12/2015 | example_2 | 1

28/12/2015 | example_2 | 9

27/12/2015 | example_3 | 7

... | ... | ...

and i want to split this data frame by 'word' column's values to obtain a "list" of DataFrame (to plot some figures in a next step). For example:

DF1

Timestamp | Word | Count

30/12/2015 | example_1 | 3

DF2

Timestamp | Word | Count

29/12/2015 | example_2 | 1

28/12/2015 | example_2 | 9

DF3

Timestamp | Word | Count

27/12/2015 | example_3 | 7

Is there a way to do this with PySpark (1.6)?

2 Answers 2

6

It won't be efficient but you can map with filter over the list of unique values:

words = df.select("Word").distinct().flatMap(lambda x: x).collect()
dfs = [df.where(df["Word"] == word) for word in words]

Post Spark 2.0

words = df.select("Word").distinct().rdd.flatMap(lambda x: x).collect()
Sign up to request clarification or add additional context in comments.

1 Comment

Note that after Spark 2.0 the correct command would be words = df.select("Word").distinct().rdd.flatMap(lambda x: x).collect()
1

In addition to what zero323 said, I would might add

word.persist()

before the creation of the dfs, so the "words" dataframe won't need to be transformed each time when you will have an action on each of your "dfs"

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.