Writing own custom aggregation function for groupby

Question

I have a Data Set that is available here

It gives us a DataFrame like

df=pd.read_csv('https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user', sep='|')
df.head()

    user_id age  gender occupation        zip_code
    1       24   M        technician        85711
    2       53   F        other             94043
    3       23   M        writer            32067
    4       24   M        technician        43537
    5       33   F        other             15213

I want to find out what is the ratio of Males:Females in each occupation

I have used the given function below but this is not the most optimal approach.

df.groupby(['occupation', 'gender']).agg({'gender':'count'}).div(df.groupby('occupation').agg('count'), level='occupation')['gender']*100

That gives us the result something like

occupation     gender
administrator  F          45.569620
               M          54.430380
artist         F          46.428571
               M          53.571429

The above answer is in a very different format as I want something like: (demo)

occupation      M:F

programmer      2:3
farmer          7:2

Can somebody please tell me how to make own aggregation functions?

you can add unstack at the end of your function (df.groupby(["occupation", "gender"]).agg({"gender": "count"}).div( df.groupby("occupation").agg("count"), level="occupation" ).unstack('gender')["gender"] * 100) but I don't understand how you get 2:3 and 7:2? — Umar.H
– Umar.H, Commented Oct 3, 2019 at 12:53
This is the processing I want to achieve. It doesn't have to be 2:3 but can be 0.5 or [30/70] or <5/7>. Anything would do just fi e. — Deshwal
– Deshwal, Commented Oct 3, 2019 at 13:39
the others beat me too it, best of luck, I think Quang's answer is bang on the mark. — Umar.H
– Umar.H, Commented Oct 3, 2019 at 14:07

Quang Hoang · Accepted Answer · 2019-10-03 14:34:14Z

2

Actually, pandas has built-in value_counts(normalized=True) for computing the value count. Then you can play with the number a bit:

new_df = (df.groupby('occupation')['gender']
            .value_counts(normalize=True)     # this gives normalized counts: 0.45
            .unstack('gender', fill_value=0)
            .round(2)                         # get two significant digits
            .mul(100)                         # get the percentage
            .astype(int)                      # get rid of .0000
            .astype(str)                      # turn to string
         )

new_df['F:M'] = new_df['F'] + ':' + new_df['M']

new_df.head()

Output:

gender          F    M    F:M
occupation                   
administrator  46   54  46:54
artist         46   54  46:54
doctor          0  100  0:100
educator       27   73  27:73
engineer        3   97   3:97

edited Oct 3, 2019 at 14:34

answered Oct 3, 2019 at 14:06

Quang Hoang

151k11 gold badges64 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

user10325516 · Accepted Answer · 2019-10-03 14:00:05Z

1

It is pretty easy actually. Every group after groupby is a dataframe (a part of initial dataframe) so you can apply your own functions to process this partial dataframe. You may add print statements inside compute_gender_ratio and see what df is.

import pandas as pd

data = pd.read_csv(
    'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user',
    sep='|')


def compute_gender_ratio(df):
    gender_count = df['gender'].value_counts()
    return f"{gender_count.get('M', 0)}:{gender_count.get('F', 0)}"


result = data.groupby('occupation').apply(compute_gender_ratio)
result_df = result.to_frame(name='M:F')

result_df is:

                  M:F
occupation           
administrator   43:36
artist          15:13
doctor            7:0
educator        69:26
engineer         65:2
entertainment    16:2
executive        29:3
healthcare       5:11
homemaker         1:6
lawyer           10:2
librarian       22:29
marketing       16:10
none              5:4
other           69:36
programmer       60:6
retired          13:1
salesman          9:3
scientist        28:3
student        136:60
technician       26:1
writer          26:19

answered Oct 3, 2019 at 14:00

user10325516

1 Comment

jtlz2 Over a year ago

"Every group after groupby is a dataframe (a part of initial dataframe) so you can apply your own functions to process this partial dataframe." - This is a key sentence and opens up a whole new world of understanding. Thank you so so much @Poolka!

Pavan Kumar Polavarapu · Accepted Answer · 2019-10-03 14:23:29Z

Does this work for you

df_g = df.groupby(['occupation', 'gender']).count().user_id/df.groupby(['occupation']).count().user_id
df_g = df_g.reset_index()
df_g['ratio'] = df_g['user_id'].apply(lambda x: str(Fraction(x).limit_denominator()).replace('/',':'))

Output

       occupation gender   user_id  ratio
0   administrator      F  0.455696  36:79
1   administrator      M  0.544304  43:79
2          artist      F  0.464286  13:28
3          artist      M  0.535714  15:28
4          doctor      M  1.000000      1
5        educator      F  0.273684  26:95
6        educator      M  0.726316  69:95
7        engineer      F  0.029851   2:67
8        engineer      M  0.970149  65:67
9   entertainment      F  0.111111    1:9
10  entertainment      M  0.888889    8:9
11      executive      F  0.093750   3:32
12      executive      M  0.906250  29:32
13     healthcare      F  0.687500  11:16
14     healthcare      M  0.312500   5:16
15      homemaker      F  0.857143    6:7
16      homemaker      M  0.142857    1:7
17         lawyer      F  0.166667    1:6
18         lawyer      M  0.833333    5:6
19      librarian      F  0.568627  29:51
20      librarian      M  0.431373  22:51
21      marketing      F  0.384615   5:13
22      marketing      M  0.615385   8:13
23           none      F  0.444444    4:9
24           none      M  0.555556    5:9
25          other      F  0.342857  12:35
26          other      M  0.657143  23:35
27     programmer      F  0.090909   1:11
28     programmer      M  0.909091  10:11
29        retired      F  0.071429   1:14
30        retired      M  0.928571  13:14
31       salesman      F  0.250000    1:4
32       salesman      M  0.750000    3:4
33      scientist      F  0.096774   3:31
34      scientist      M  0.903226  28:31
35        student      F  0.306122  15:49
36        student      M  0.693878  34:49
37     technician      F  0.037037   1:27
38     technician      M  0.962963  26:27
39         writer      F  0.422222  19:45
40         writer      M  0.577778  26:45

Collectives™ on Stack Overflow

Writing own custom aggregation function for groupby

3 Answers 3

Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related