1

I have two columns like shown below, and trying to return the highest count of the second column, but its just returning me the highest count on rating without considering the gender

DATA :

print (df)

   AGE GENDER rating
0   10      M     PG
1   10      M      R
2   10      M      R
3    4      F   PG13
4    4      F   PG13

CODE :

 s = (df.groupby(['AGE', 'GENDER'])['rating']
       .apply(lambda x: x.value_counts().head(2))
       .rename_axis(('a','b', 'c'))
       .reset_index(level=2)['c'])

OUTPUT :

print (s[F])
('PG')

print(s[M]

('PG', 'R')
5
  • 3
    I dont see a question here Commented Feb 8, 2018 at 8:16
  • What is your question Commented Feb 8, 2018 at 8:20
  • I am not able to return the highest rating for male and female separately Commented Feb 8, 2018 at 8:27
  • update your question Commented Feb 8, 2018 at 8:27
  • Thanks I did it @espoir Commented Feb 8, 2018 at 8:28

2 Answers 2

2

Here is a standard library solution for this file:

%%file "test.txt"
gender  rating
M   PG
M   R
F   NR
M   R
F   PG13
F   PG13

Given

import collections as ct


def read_file(fname):
    with open(fname, "r") as f:
        header = next(f)
        for line in f:
            gender, rating = line.strip().split()
            yield gender, rating

Code

filename = "test.txt"

dd = ct.defaultdict(ct.Counter)
for k, v in sorted(read_file(filename), key=lambda x: x[0]):
    dd[k][v] += 1 

{k: v.most_common(1) for k, v in dd.items()}
# {'F': [('PG13', 2)], 'M': [('R', 2)]}

Details

Each line of the file is parse and added to a defaultdict. The keys are genders, but the values are Counter objects for each rating per gender. Counter.most_common() is called to retrieve the top occurrences.

Since the data is grouped by gender, you can explore more information. For example, unique ratings of each gender:

{k: set(v.elements()) for k, v in dd.items()}
# {'F': {'NR', 'PG13'}, 'M': {'PG', 'R'}}
Sign up to request clarification or add additional context in comments.

3 Comments

hey what if I have an additional column say age_range which have values like, 'young', 'adult', so i want the top most from young and male in a combination.
This code would have to be modified to handle extra columns. If you post a new question, I could address it. I will leave this answer as is.
1

I think you need for counts with categories and ratings use groupby + value_counts + head:

df1 = (df.groupby('gender')['rating']
         .apply(lambda x: x.value_counts().head(1))
         .rename_axis(('gender','rating'))
         .reset_index(name='val'))
print (df1)
  gender rating  val
0      F   PG13    2
1      M      R    2

If want only top ratings seelct first value of index per group:

s = df.groupby('gender')['rating'].apply(lambda x: x.value_counts().index[0])
print (s)
gender
F    PG13
M       R
Name: rating, dtype: object

print (s['M'])
R
print (s['F'])
PG13

Or only top counts select first value of Series per group:

s = df.groupby('gender')['rating'].apply(lambda x: x.value_counts().iat[0])
print (s)
gender
F    2
M    2
Name: rating, dtype: int64

print (s['M'])
2
print (s['F'])
2

EDIT:

s = df.groupby('gender')['rating'].apply(lambda x: x.value_counts().index[0])

def gen_mpaa(gender):
    return s[gender]

print (gen_mpaa('M'))

print (gen_mpaa('F'))

EDIT:

Solution if genre id values are strings:

print (type(df.loc[0, 'genre id']))
<class 'str'>

df = df.set_index('gender')['genre id'].str.split(',', expand=True).stack()
print (df)
gender   
M       0    11
        1    22
        2    33
        0    22
        1    44
        2    55
        0    33
        1    44
        2    55
F       0    11
        1    22
        0    22
        1    55
        0    55
        1    44
dtype: object

d = df.groupby(level=0).apply(lambda x: x.value_counts().index[0]).to_dict()
print (d)
{'M': '55', 'F': '55'}

EDIT1:

print (df)
   AGE GENDER rating
0   10      M     PG
1   10      M      R
2   10      M      R
3    4      F   PG13
4    4      F   PG13

s = (df.groupby(['AGE', 'GENDER'])['rating']
       .apply(lambda x: x.value_counts().head(2))
       .rename_axis(('a','b', 'c'))
       .reset_index(level=2)['c'])
print (s)

a   b
4   F    PG13
10  M       R
    M      PG
Name: c, dtype: object

44 Comments

when I insert this " s = df.groupby('gender')['rating'].apply(lambda x: x.value_counts().index[0]) " and returning s... its throwing me an error. I just need to send gender as my input for the function and it should give me the most frequent rating directly
its just giving me the rating as same for both the genders, like F- pg13 and M -PG13
Hmm, maybe same number of top, you can check it by print(df.groupby('gender')['rating'].value_counts())
This is my input gen_mpaa('F') , the out put is GENDERCODE F PG13 M PG13 U PG13
Jez, I guess this will solve my problem, instead of this 10 M R M PG as my output, cant it return R and PG if top 1 and top2 are there, and PG if only top1 is there, --- ('R', 'PG') if two, ('PG') if only top1
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.