2

I have a rdd from which I need to extract counts of multiple events. The initial rdd looks like this

+----------+--------------------+-------------------+
|     event|                user|                day|
+----------+--------------------+-------------------+
|event_x   |user_A              |                  0|
|event_y   |user_A              |                  2|
|event_x   |user_B              |                  2|
|event_y   |user_B              |                  1|
|event_x   |user_A              |                  0|
|event_x   |user_B              |                  1|
|event_y   |user_B              |                  2|
|event_y   |user_A              |                  1|
+----------+--------------------+-------------------+

I need a count column for each type of event (in this case 2 types of events: event_x and event_y), grouped by player and day. So far, I managed to do it with only one event, resulting in the following:

+--------------------+-------------------+------------+
|                user|                day|count(event)|
+--------------------+-------------------+------------+
|user_A              |                  0|          11|
|user_A              |                  1|           8|
|user_A              |                  2|           4|
|user_B              |                  0|           2|
|user_B              |                  1|           1|
|user_B              |                  2|          25|
+--------------------+-------------------+------------+

But I need arbitrarily many columns, being the number of columns the same as the number of events that appear in the leftmost column of the first rdd displayed above. So, if I only had 2 events (x and y) it should be something like this:

+--------------------+-------------------+--------------+--------------+
|                user|                day|count(event_x)|count(event_y)|
+--------------------+-------------------+--------------+--------------+
|user_A              |                  0|            11|             3|
|user_A              |                  1|             8|            23| 
|user_A              |                  2|             4|             2|
|user_B              |                  0|             2|             0|
|user_B              |                  1|             1|             1|
|user_B              |                  2|            25|            11|
+--------------------+-------------------+--------------+--------------+

The code I have currently is:

rdd = rdd.groupby('user', 'day').agg({'event': 'count'}).orderBy('user', 'day')

What should I do to achieve the desired result?

Thanks in advance ;)

2
  • Would doing rdd.groupby('user', 'day', 'event').count().orderBy('user', 'day') work for you? You could probably start with that and then pivot... Commented Nov 12, 2019 at 20:54
  • Thanks for the answer, yours was similar to what Mahesh said. It works =) Commented Nov 13, 2019 at 13:28

2 Answers 2

1

you can try group by with pivot option

df =spark.createDataFrame([["event_x","user_A",0],["event_y","user_A",2],["event_x","user_B",2],["event_y","user_B",1],["event_x","user_A",0],["event_x","user_B",1],["event_y","user_B",2],["event_y","user_A",1]],["event","user","day"])

>>> df.show()
+-------+------+---+                                                            
|  event|  user|day|
+-------+------+---+
|event_x|user_A|  0|
|event_y|user_A|  2|
|event_x|user_B|  2|
|event_y|user_B|  1|
|event_x|user_A|  0|
|event_x|user_B|  1|
|event_y|user_B|  2|
|event_y|user_A|  1|
+-------+------+---+

>>> df.groupBy(["user","day"]).pivot("event").agg({"event":"count"}).show()
+------+---+-------+-------+
|  user|day|event_x|event_y|
+------+---+-------+-------+
|user_A|  0|      2|   null|
|user_B|  1|      1|      1|
|user_A|  2|   null|      1|
|user_A|  1|   null|      1|
|user_B|  2|      1|      1|
+------+---+-------+-------+

please have a look and do let me know if you have any doubts about same.

Sign up to request clarification or add additional context in comments.

Comments

0
temp = temp.groupBy("columnname").pivot("bucket_columnaname").agg({"bucket_columnaname":"count"})

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.