0

i have a dataframe like

       customer         genre
0      cust1           |BIOPIC|DRAMA|
1      cust2           |COMEDY|DRAMA|ROMANCE|
2      cust1           |DRAMA|THRILLER|
3      cust3           |COMEDY|HORROR|
4      cust4           |HISTORY|ROMANCE|WAR|
5      cust3           |ADVENTURE|COMEDY|
6      cust2           |ACTION|DRAMA|THRILLER|
7      cust1           |CRIME|DRAMA|THRILLER|
8      cust3           |HISTORY|ROMANCE|WAR|
9      cust2           |ADVENTURE|COMEDY|
10     cust4           |BIOPIC|DRAMA|HISTORY|THRILLER|

I need = how many times each customer did transaction(customer count) and their respective genre count.Eg. cust1 DRAMA = 3, cust1 THRILLER = 2,like wise for each customer's.

I did found the each customer count by

df = df.groupby(['cust']).size()

then i know how to filter out the genres and getting the count if it was within a LIST , but i am getting confused with how to proceed with each group of customer and getting the count for each customer's indivisual genre count.

filtering(|) from genre expression and getting the fields out.

please suggest.

1 Answer 1

1

The feature str.get_dummies is perfect for this sort of thing! It works just like the dataframe version but on strings and allows you to specify a delimiter. Assuming your dataframe is named df, then the below code does what you're after:

import pandas as pd
import numpy as np   
df = pd.concat([df, df.Genres.str.get_dummies(sep='|')], axis=1)
df = df.groupby("Customers").aggregate(np.sum)

print(df)

output:

           ACTION  ADVENTURE  BIOPIC  COMEDY  CRIME  DRAMA  HISTORY  HORROR  \
Customers
cust1           0          0       1       0      1      3        0       0
cust2           1          1       0       2      0      2        0       0
cust3           0          1       0       2      0      0        1       1
cust4           0          0       1       0      0      1        2       0

To explain a bit, str.get_dummies method makes a new column for every value it sees in the column specified and then marks a 1 for the values present and a 0 elsewhere. The GroupBy and Aggregate methods make clusters according to the customers and add up the columns. Aggregate will silently drop columns which it can't add, in this case the original Genres column.

Sign up to request clarification or add additional context in comments.

3 Comments

can we make a multi-index kind of thing under the genre field and keep each genre category(action, adventure,comedy) under that parent field genre field?
And also i don't want customer field as index. If i am doing a reset_index(), i am getting an extra field unnamed o as field and an index named o.then i need to delete that extra unnamed 0 field. Is there a way to get the normal index ,customer as a field in that code. Because it becomes hectic to proceed further with a customer field as index.
Resetting the index after the groupby and aggregate gives you a new column titled 'Customers' with the standard numeric index. I'm not sure how to do what you want with the multi-index, you might consider asking a new question.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.