Group data by two columns and count it using pandas

Question

I am having the following data.

songs
play_event

In songs the data is as below:

song_id  total_plays
1        2000
2        4532
3        9999
4        2343

And in play event the data is as below:

user_id song_id
102         1
103         4
102         1
102         3
104         2
102         1

For each time a song was played, there is a new entry, even is a song is played again.

With this data I want to:

Get total no. of time each user played each songs. For example, if user_id 102 played, the song_id 1 three times, as per above data. I want to have it grouped by the user_id with total count. Something like below:
```
user_id  song_id  count
102      1        3
102      3        1
103      4        1
104      2        1
```

I am thinking of using Pandas to do this. But I want to know if pandas is the right choice.

If its not pandas, then what should be my way forward.

If Pandas is the right choice, then:

The below code allows me to get the count either grouped by user or grouped by user_id how do we get the count grouped by user_id & song_id? See a sample code I tried below:

import pandas as pd

#Load data from csv file
data = pd.DataFrame.from_csv('play_events.csv')

# Gives how many entries per user
data['user_id'].value_counts()

# Gives how many entries per songs
data['song_id'].value_counts()

It seems that your first dataframe is not relevant to the issue. Am I missing something? — sacuL
– sacuL, Commented Oct 11, 2018 at 19:45
since your data gets updated frequently & ends up in a data store, you might as well consider a database/rdbms & do your data operations with sql — deadvoid
– deadvoid, Commented Oct 11, 2018 at 19:56
@cryptonome Can I achieve this with elasticsearch? So that I can scale better? The strategy was to do this every day once during low traffic hours. — esafwan
– esafwan, Commented Oct 11, 2018 at 20:09
i'm not familiar with elasticsearch but cursory research (like, 5 minutes) says no. any decent rdbms can do what you want, but since i don't know what you're expecting in terms of scaling, some choices like sqlite can probably left out of the pool. i'd say postgresql would be my automatic choice, but there's more to database planning than just which product to choose, so research on your requirements & consult your friendly DBA. — deadvoid
– deadvoid, Commented Oct 11, 2018 at 20:24

sacuL · Accepted Answer · 2018-10-11 19:48:09Z

1

For your first problem, a simple groupby and value_counts does the trick. Note that everything after value_counts() in the code below is just to get it to an actual dataframe in the same format as your desired output.

counts = play_events.groupby('user_id')['song_id'].value_counts().to_frame('count').reset_index()

>>> counts
   user_id  song_id  count
0      102        1      3
1      102        3      1
2      103        4      1
3      104        2      1

Then for your second problem (which you have deleted in your edited post, but I will leave just in case it is useful to you), you can loop through counts, grouping by user_id, and save each as csv:

for user, data in counts.groupby('user_id', as_index=False):
    data.to_csv(str(user)+'_events.csv')

For your example dataframes, this gives you 3 csvs: 102_events.csv, 103_events.csv, and 103_events.csv. The first looks like:

   user_id  song_id  count
0      102        1      3
1      102        3      1

answered Oct 11, 2018 at 19:48

sacuL

51.6k9 gold badges88 silver badges115 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

esafwan Over a year ago

I deleted the second part, thinking that would narrow my question to the key problem.

Collectives™ on Stack Overflow

Group data by two columns and count it using pandas

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related