function to return the highest count value using a rule

Question

I have two columns like shown below, and trying to return the highest count of the second column, but its just returning me the highest count on rating without considering the gender

DATA :

print (df)

   AGE GENDER rating
0   10      M     PG
1   10      M      R
2   10      M      R
3    4      F   PG13
4    4      F   PG13

CODE :

 s = (df.groupby(['AGE', 'GENDER'])['rating']
       .apply(lambda x: x.value_counts().head(2))
       .rename_axis(('a','b', 'c'))
       .reset_index(level=2)['c'])

OUTPUT :

print (s[F])
('PG')

print(s[M]

('PG', 'R')

I am not able to return the highest rating for male and female separately — pylearner
– pylearner, Commented Feb 8, 2018 at 8:27

pylang · Accepted Answer · 2018-02-10 17:39:45Z

2

Here is a standard library solution for this file:

%%file "test.txt"
gender  rating
M   PG
M   R
F   NR
M   R
F   PG13
F   PG13

Given

import collections as ct


def read_file(fname):
    with open(fname, "r") as f:
        header = next(f)
        for line in f:
            gender, rating = line.strip().split()
            yield gender, rating

Code

filename = "test.txt"

dd = ct.defaultdict(ct.Counter)
for k, v in sorted(read_file(filename), key=lambda x: x[0]):
    dd[k][v] += 1 

{k: v.most_common(1) for k, v in dd.items()}
# {'F': [('PG13', 2)], 'M': [('R', 2)]}

Details

Each line of the file is parse and added to a defaultdict. The keys are genders, but the values are Counter objects for each rating per gender. Counter.most_common() is called to retrieve the top occurrences.

Since the data is grouped by gender, you can explore more information. For example, unique ratings of each gender:

{k: set(v.elements()) for k, v in dd.items()}
# {'F': {'NR', 'PG13'}, 'M': {'PG', 'R'}}

edited Feb 10, 2018 at 17:39

answered Feb 8, 2018 at 10:31

pylang

45.4k16 gold badges137 silver badges133 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

pylearner Over a year ago

hey what if I have an additional column say age_range which have values like, 'young', 'adult', so i want the top most from young and male in a combination.

pylang Over a year ago

This code would have to be modified to handle extra columns. If you post a new question, I could address it. I will leave this answer as is.

pylearner Over a year ago

stackoverflow.com/questions/48719674/… ...can you look in here

jezrael · Accepted Answer · 2018-02-13 12:49:50Z

1

I think you need for counts with categories and ratings use groupby + value_counts + head:

df1 = (df.groupby('gender')['rating']
         .apply(lambda x: x.value_counts().head(1))
         .rename_axis(('gender','rating'))
         .reset_index(name='val'))
print (df1)
  gender rating  val
0      F   PG13    2
1      M      R    2

If want only top ratings seelct first value of index per group:

s = df.groupby('gender')['rating'].apply(lambda x: x.value_counts().index[0])
print (s)
gender
F    PG13
M       R
Name: rating, dtype: object

print (s['M'])
R
print (s['F'])
PG13

Or only top counts select first value of Series per group:

s = df.groupby('gender')['rating'].apply(lambda x: x.value_counts().iat[0])
print (s)
gender
F    2
M    2
Name: rating, dtype: int64

print (s['M'])
2
print (s['F'])
2

EDIT:

s = df.groupby('gender')['rating'].apply(lambda x: x.value_counts().index[0])

def gen_mpaa(gender):
    return s[gender]

print (gen_mpaa('M'))

print (gen_mpaa('F'))

EDIT:

Solution if genre id values are strings:

print (type(df.loc[0, 'genre id']))
<class 'str'>

df = df.set_index('gender')['genre id'].str.split(',', expand=True).stack()
print (df)
gender   
M       0    11
        1    22
        2    33
        0    22
        1    44
        2    55
        0    33
        1    44
        2    55
F       0    11
        1    22
        0    22
        1    55
        0    55
        1    44
dtype: object

d = df.groupby(level=0).apply(lambda x: x.value_counts().index[0]).to_dict()
print (d)
{'M': '55', 'F': '55'}

EDIT1:

print (df)
   AGE GENDER rating
0   10      M     PG
1   10      M      R
2   10      M      R
3    4      F   PG13
4    4      F   PG13

s = (df.groupby(['AGE', 'GENDER'])['rating']
       .apply(lambda x: x.value_counts().head(2))
       .rename_axis(('a','b', 'c'))
       .reset_index(level=2)['c'])
print (s)

a   b
4   F    PG13
10  M       R
    M      PG
Name: c, dtype: object

edited Feb 13, 2018 at 12:49

answered Feb 8, 2018 at 8:37

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

44 Comments

pylearner Over a year ago

when I insert this " s = df.groupby('gender')['rating'].apply(lambda x: x.value_counts().index[0]) " and returning s... its throwing me an error. I just need to send gender as my input for the function and it should give me the most frequent rating directly

pylearner Over a year ago

its just giving me the rating as same for both the genders, like F- pg13 and M -PG13

jezrael Over a year ago

Hmm, maybe same number of top, you can check it by print(df.groupby('gender')['rating'].value_counts())

pylearner Over a year ago

This is my input gen_mpaa('F') , the out put is GENDERCODE F PG13 M PG13 U PG13

pylearner Over a year ago

Jez, I guess this will solve my problem, instead of this 10 M R M PG as my output, cant it return R and PG if top 1 and top2 are there, and PG if only top1 is there, --- ('R', 'PG') if two, ('PG') if only top1

|

Collectives™ on Stack Overflow

function to return the highest count value using a rule

2 Answers 2

3 Comments

44 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

44 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related