0

I am having the following data.

  1. songs
  2. play_event

In songs the data is as below:

song_id  total_plays
1        2000
2        4532
3        9999
4        2343

And in play event the data is as below:

user_id song_id
102         1
103         4
102         1
102         3
104         2
102         1

For each time a song was played, there is a new entry, even is a song is played again.

With this data I want to:

  1. Get total no. of time each user played each songs. For example, if user_id 102 played, the song_id 1 three times, as per above data. I want to have it grouped by the user_id with total count. Something like below:

    user_id  song_id  count
    102      1        3
    102      3        1
    103      4        1
    104      2        1
    

I am thinking of using Pandas to do this. But I want to know if pandas is the right choice.

If its not pandas, then what should be my way forward.

If Pandas is the right choice, then:

The below code allows me to get the count either grouped by user or grouped by user_id how do we get the count grouped by user_id & song_id? See a sample code I tried below:

import pandas as pd

#Load data from csv file
data = pd.DataFrame.from_csv('play_events.csv')

# Gives how many entries per user
data['user_id'].value_counts()

# Gives how many entries per songs
data['song_id'].value_counts()
5
  • 1
    It seems that your first dataframe is not relevant to the issue. Am I missing something? Commented Oct 11, 2018 at 19:45
  • @sacul True, I added it to give context. Commented Oct 11, 2018 at 19:46
  • 1
    since your data gets updated frequently & ends up in a data store, you might as well consider a database/rdbms & do your data operations with sql Commented Oct 11, 2018 at 19:56
  • @cryptonome Can I achieve this with elasticsearch? So that I can scale better? The strategy was to do this every day once during low traffic hours. Commented Oct 11, 2018 at 20:09
  • 1
    i'm not familiar with elasticsearch but cursory research (like, 5 minutes) says no. any decent rdbms can do what you want, but since i don't know what you're expecting in terms of scaling, some choices like sqlite can probably left out of the pool. i'd say postgresql would be my automatic choice, but there's more to database planning than just which product to choose, so research on your requirements & consult your friendly DBA. Commented Oct 11, 2018 at 20:24

1 Answer 1

1

For your first problem, a simple groupby and value_counts does the trick. Note that everything after value_counts() in the code below is just to get it to an actual dataframe in the same format as your desired output.

counts = play_events.groupby('user_id')['song_id'].value_counts().to_frame('count').reset_index()

>>> counts
   user_id  song_id  count
0      102        1      3
1      102        3      1
2      103        4      1
3      104        2      1

Then for your second problem (which you have deleted in your edited post, but I will leave just in case it is useful to you), you can loop through counts, grouping by user_id, and save each as csv:

for user, data in counts.groupby('user_id', as_index=False):
    data.to_csv(str(user)+'_events.csv')

For your example dataframes, this gives you 3 csvs: 102_events.csv, 103_events.csv, and 103_events.csv. The first looks like:

   user_id  song_id  count
0      102        1      3
1      102        3      1
Sign up to request clarification or add additional context in comments.

1 Comment

I deleted the second part, thinking that would narrow my question to the key problem.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.