finding dataframe group count

Question

i have a dataframe like

       customer         genre
0      cust1           |BIOPIC|DRAMA|
1      cust2           |COMEDY|DRAMA|ROMANCE|
2      cust1           |DRAMA|THRILLER|
3      cust3           |COMEDY|HORROR|
4      cust4           |HISTORY|ROMANCE|WAR|
5      cust3           |ADVENTURE|COMEDY|
6      cust2           |ACTION|DRAMA|THRILLER|
7      cust1           |CRIME|DRAMA|THRILLER|
8      cust3           |HISTORY|ROMANCE|WAR|
9      cust2           |ADVENTURE|COMEDY|
10     cust4           |BIOPIC|DRAMA|HISTORY|THRILLER|

I need = how many times each customer did transaction(customer count) and their respective genre count.Eg. cust1 DRAMA = 3, cust1 THRILLER = 2,like wise for each customer's.

I did found the each customer count by

df = df.groupby(['cust']).size()

then i know how to filter out the genres and getting the count if it was within a LIST , but i am getting confused with how to proceed with each group of customer and getting the count for each customer's indivisual genre count.

filtering(|) from genre expression and getting the fields out.

please suggest.

Chris McDonald · Accepted Answer · 2015-11-16 11:48:18Z

1

The feature str.get_dummies is perfect for this sort of thing! It works just like the dataframe version but on strings and allows you to specify a delimiter. Assuming your dataframe is named df, then the below code does what you're after:

import pandas as pd
import numpy as np   
df = pd.concat([df, df.Genres.str.get_dummies(sep='|')], axis=1)
df = df.groupby("Customers").aggregate(np.sum)

print(df)

output:

           ACTION  ADVENTURE  BIOPIC  COMEDY  CRIME  DRAMA  HISTORY  HORROR  \
Customers
cust1           0          0       1       0      1      3        0       0
cust2           1          1       0       2      0      2        0       0
cust3           0          1       0       2      0      0        1       1
cust4           0          0       1       0      0      1        2       0

To explain a bit, str.get_dummies method makes a new column for every value it sees in the column specified and then marks a 1 for the values present and a 0 elsewhere. The GroupBy and Aggregate methods make clusters according to the customers and add up the columns. Aggregate will silently drop columns which it can't add, in this case the original Genres column.

answered Nov 16, 2015 at 11:48

Chris McDonald

1867 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Satya Over a year ago

can we make a multi-index kind of thing under the genre field and keep each genre category(action, adventure,comedy) under that parent field genre field?

Satya Over a year ago

And also i don't want customer field as index. If i am doing a reset_index(), i am getting an extra field unnamed o as field and an index named o.then i need to delete that extra unnamed 0 field. Is there a way to get the normal index ,customer as a field in that code. Because it becomes hectic to proceed further with a customer field as index.

Chris McDonald Over a year ago

Resetting the index after the groupby and aggregate gives you a new column titled 'Customers' with the standard numeric index. I'm not sure how to do what you want with the multi-index, you might consider asking a new question.

Collectives™ on Stack Overflow

finding dataframe group count

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related