how to count the elements in a Pyspark dataframe

Question

I have a pyspark dataframe. It is a movie dataset. One column is the genres split by |. Each movie has multiple genres.

genres = spark.sql("SELECT DISTINCT genres FROM movies ORDER BY genres ASC")
genres.show(5)

I would like to count each genre has how many movies. And I also want to show what are those movies. Just like the following: How should I do this?

YOLO · Accepted Answer · 2020-01-07 10:02:21Z

3

Here's a way to do:

# sample data
d = [('Action',), ('Action|Adventure',), ('Action|Adventure|Drama',)]
df = spark.createDataFrame(d, ['genres',])

# create count
agg_df = (df
          .rdd
          .map(lambda x: x.genres.split('|')) # gives nested list
          .flatMap(lambda x: x) # flatten the list
          .map(lambda x: (x,)) # convert to tuples
          .toDF(['genres'])
          .groupby('genres')
          .count())

agg_df.show()

+---------+-----+
|   genres|count|
+---------+-----+
|Adventure|    2|
|    Drama|    1|
|   Action|    3|
+---------+-----+

edited Jan 7, 2020 at 10:02

answered Jan 7, 2020 at 7:48

YOLO

22k5 gold badges25 silver badges42 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

user10262232 Over a year ago

So is it possible not to convert to rdd and directly work on the dataframe?

YOLO Over a year ago

yes, it is possible using udf functions but native spark functions have speed advantages.

blackbishop Over a year ago

Using DataFrame API does not imply using UDF, there are a lot of Spark built-in functions to do this. I added an answer to show one simple way.

blackbishop · Accepted Answer · 2020-01-07 12:44:41Z

2

Here is a way using only DataFrame API. First, use split function to split the genres strings then explode the result array and groupBy genres to count:

data = [["Action"], ["Action|Adventure|Thriller"], ["Action|Adventure|Drama"]]
df = spark.createDataFrame(data, ["genres"])

df = df.withColumn("genres", explode(split(col("genres"), "[|]"))) \
    .groupBy("genres").count()

df.show()

Gives:

+---------+-----+
|   genres|count|
+---------+-----+
| Thriller|    1|
|Adventure|    2|
|    Drama|    1|
|   Action|    3|
+---------+-----+

answered Jan 7, 2020 at 12:44

blackbishop

32.8k11 gold badges61 silver badges86 bronze badges

Comments

Andy_101 · Accepted Answer · 2020-01-07 07:29:37Z

0

Use:

import pyspark.sql.functions as f
df.groupby("generes").agg(f.collect_set("Category"),f.count("Category")).show()

this will get the desired output.

answered Jan 7, 2020 at 7:29

Andy_101

1,30810 silver badges21 bronze badges

Collectives™ on Stack Overflow

how to count the elements in a Pyspark dataframe

3 Answers 3

3 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related