Split count results of different events into different columns in pyspark

Question

I have a rdd from which I need to extract counts of multiple events. The initial rdd looks like this

+----------+--------------------+-------------------+
|     event|                user|                day|
+----------+--------------------+-------------------+
|event_x   |user_A              |                  0|
|event_y   |user_A              |                  2|
|event_x   |user_B              |                  2|
|event_y   |user_B              |                  1|
|event_x   |user_A              |                  0|
|event_x   |user_B              |                  1|
|event_y   |user_B              |                  2|
|event_y   |user_A              |                  1|
+----------+--------------------+-------------------+

I need a count column for each type of event (in this case 2 types of events: event_x and event_y), grouped by player and day. So far, I managed to do it with only one event, resulting in the following:

+--------------------+-------------------+------------+
|                user|                day|count(event)|
+--------------------+-------------------+------------+
|user_A              |                  0|          11|
|user_A              |                  1|           8|
|user_A              |                  2|           4|
|user_B              |                  0|           2|
|user_B              |                  1|           1|
|user_B              |                  2|          25|
+--------------------+-------------------+------------+

But I need arbitrarily many columns, being the number of columns the same as the number of events that appear in the leftmost column of the first rdd displayed above. So, if I only had 2 events (x and y) it should be something like this:

+--------------------+-------------------+--------------+--------------+
|                user|                day|count(event_x)|count(event_y)|
+--------------------+-------------------+--------------+--------------+
|user_A              |                  0|            11|             3|
|user_A              |                  1|             8|            23| 
|user_A              |                  2|             4|             2|
|user_B              |                  0|             2|             0|
|user_B              |                  1|             1|             1|
|user_B              |                  2|            25|            11|
+--------------------+-------------------+--------------+--------------+

The code I have currently is:

rdd = rdd.groupby('user', 'day').agg({'event': 'count'}).orderBy('user', 'day')

What should I do to achieve the desired result?

Thanks in advance ;)

Would doing rdd.groupby('user', 'day', 'event').count().orderBy('user', 'day') work for you? You could probably start with that and then pivot... — pault
– pault, Commented Nov 12, 2019 at 20:54
Thanks for the answer, yours was similar to what Mahesh said. It works =) — RafaJM
– RafaJM, Commented Nov 13, 2019 at 13:28

Mahesh Gupta · Accepted Answer · 2019-11-13 10:25:00Z

you can try group by with pivot option

df =spark.createDataFrame([["event_x","user_A",0],["event_y","user_A",2],["event_x","user_B",2],["event_y","user_B",1],["event_x","user_A",0],["event_x","user_B",1],["event_y","user_B",2],["event_y","user_A",1]],["event","user","day"])

>>> df.show()
+-------+------+---+                                                            
|  event|  user|day|
+-------+------+---+
|event_x|user_A|  0|
|event_y|user_A|  2|
|event_x|user_B|  2|
|event_y|user_B|  1|
|event_x|user_A|  0|
|event_x|user_B|  1|
|event_y|user_B|  2|
|event_y|user_A|  1|
+-------+------+---+

>>> df.groupBy(["user","day"]).pivot("event").agg({"event":"count"}).show()
+------+---+-------+-------+
|  user|day|event_x|event_y|
+------+---+-------+-------+
|user_A|  0|      2|   null|
|user_B|  1|      1|      1|
|user_A|  2|   null|      1|
|user_A|  1|   null|      1|
|user_B|  2|      1|      1|
+------+---+-------+-------+

please have a look and do let me know if you have any doubts about same.

Nikita Shabankin · Accepted Answer · 2022-10-17 17:22:06Z

0

temp = temp.groupBy("columnname").pivot("bucket_columnaname").agg({"bucket_columnaname":"count"})

edited Oct 17, 2022 at 17:22

Nikita Shabankin

6349 silver badges18 bronze badges

answered Oct 13, 2022 at 6:54

Jha Ayush

871 silver badge9 bronze badges

Collectives™ on Stack Overflow

Split count results of different events into different columns in pyspark

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related