0

I have a pyspark dataframe. It is a movie dataset. One column is the genres split by |. Each movie has multiple genres.

genres = spark.sql("SELECT DISTINCT genres FROM movies ORDER BY genres ASC")
genres.show(5)

enter image description hereI would like to count each genre has how many movies. And I also want to show what are those movies. Just like the following: enter image description hereenter image description here How should I do this?

3 Answers 3

3

Here's a way to do:

# sample data
d = [('Action',), ('Action|Adventure',), ('Action|Adventure|Drama',)]
df = spark.createDataFrame(d, ['genres',])

# create count
agg_df = (df
          .rdd
          .map(lambda x: x.genres.split('|')) # gives nested list
          .flatMap(lambda x: x) # flatten the list
          .map(lambda x: (x,)) # convert to tuples
          .toDF(['genres'])
          .groupby('genres')
          .count())

agg_df.show()

+---------+-----+
|   genres|count|
+---------+-----+
|Adventure|    2|
|    Drama|    1|
|   Action|    3|
+---------+-----+
Sign up to request clarification or add additional context in comments.

3 Comments

So is it possible not to convert to rdd and directly work on the dataframe?
yes, it is possible using udf functions but native spark functions have speed advantages.
Using DataFrame API does not imply using UDF, there are a lot of Spark built-in functions to do this. I added an answer to show one simple way.
2

Here is a way using only DataFrame API. First, use split function to split the genres strings then explode the result array and groupBy genres to count:

data = [["Action"], ["Action|Adventure|Thriller"], ["Action|Adventure|Drama"]]
df = spark.createDataFrame(data, ["genres"])

df = df.withColumn("genres", explode(split(col("genres"), "[|]"))) \
    .groupBy("genres").count()

df.show()

Gives:

+---------+-----+
|   genres|count|
+---------+-----+
| Thriller|    1|
|Adventure|    2|
|    Drama|    1|
|   Action|    3|
+---------+-----+

Comments

0

Use:

import pyspark.sql.functions as f
df.groupby("generes").agg(f.collect_set("Category"),f.count("Category")).show()

this will get the desired output.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.