5

I have a Data Set that is available here

It gives us a DataFrame like

df=pd.read_csv('https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user', sep='|')
df.head()

    user_id age  gender occupation        zip_code
    1       24   M        technician        85711
    2       53   F        other             94043
    3       23   M        writer            32067
    4       24   M        technician        43537
    5       33   F        other             15213

I want to find out what is the ratio of Males:Females in each occupation

I have used the given function below but this is not the most optimal approach.

df.groupby(['occupation', 'gender']).agg({'gender':'count'}).div(df.groupby('occupation').agg('count'), level='occupation')['gender']*100

That gives us the result something like

occupation     gender
administrator  F          45.569620
               M          54.430380
artist         F          46.428571
               M          53.571429

The above answer is in a very different format as I want something like: (demo)

occupation      M:F

programmer      2:3
farmer          7:2

Can somebody please tell me how to make own aggregation functions?

3
  • you can add unstack at the end of your function (df.groupby(["occupation", "gender"]).agg({"gender": "count"}).div( df.groupby("occupation").agg("count"), level="occupation" ).unstack('gender')["gender"] * 100) but I don't understand how you get 2:3 and 7:2? Commented Oct 3, 2019 at 12:53
  • 1
    This is the processing I want to achieve. It doesn't have to be 2:3 but can be 0.5 or [30/70] or <5/7>. Anything would do just fi e. Commented Oct 3, 2019 at 13:39
  • the others beat me too it, best of luck, I think Quang's answer is bang on the mark. Commented Oct 3, 2019 at 14:07

3 Answers 3

2

Actually, pandas has built-in value_counts(normalized=True) for computing the value count. Then you can play with the number a bit:

new_df = (df.groupby('occupation')['gender']
            .value_counts(normalize=True)     # this gives normalized counts: 0.45
            .unstack('gender', fill_value=0)
            .round(2)                         # get two significant digits
            .mul(100)                         # get the percentage
            .astype(int)                      # get rid of .0000
            .astype(str)                      # turn to string
         )

new_df['F:M'] = new_df['F'] + ':' + new_df['M']

new_df.head()

Output:

gender          F    M    F:M
occupation                   
administrator  46   54  46:54
artist         46   54  46:54
doctor          0  100  0:100
educator       27   73  27:73
engineer        3   97   3:97
Sign up to request clarification or add additional context in comments.

Comments

1

It is pretty easy actually. Every group after groupby is a dataframe (a part of initial dataframe) so you can apply your own functions to process this partial dataframe. You may add print statements inside compute_gender_ratio and see what df is.

import pandas as pd

data = pd.read_csv(
    'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user',
    sep='|')


def compute_gender_ratio(df):
    gender_count = df['gender'].value_counts()
    return f"{gender_count.get('M', 0)}:{gender_count.get('F', 0)}"


result = data.groupby('occupation').apply(compute_gender_ratio)
result_df = result.to_frame(name='M:F')

result_df is:

                  M:F
occupation           
administrator   43:36
artist          15:13
doctor            7:0
educator        69:26
engineer         65:2
entertainment    16:2
executive        29:3
healthcare       5:11
homemaker         1:6
lawyer           10:2
librarian       22:29
marketing       16:10
none              5:4
other           69:36
programmer       60:6
retired          13:1
salesman          9:3
scientist        28:3
student        136:60
technician       26:1
writer          26:19

1 Comment

"Every group after groupby is a dataframe (a part of initial dataframe) so you can apply your own functions to process this partial dataframe." - This is a key sentence and opens up a whole new world of understanding. Thank you so so much @Poolka!
0

Does this work for you

df_g = df.groupby(['occupation', 'gender']).count().user_id/df.groupby(['occupation']).count().user_id
df_g = df_g.reset_index()
df_g['ratio'] = df_g['user_id'].apply(lambda x: str(Fraction(x).limit_denominator()).replace('/',':'))

Output

       occupation gender   user_id  ratio
0   administrator      F  0.455696  36:79
1   administrator      M  0.544304  43:79
2          artist      F  0.464286  13:28
3          artist      M  0.535714  15:28
4          doctor      M  1.000000      1
5        educator      F  0.273684  26:95
6        educator      M  0.726316  69:95
7        engineer      F  0.029851   2:67
8        engineer      M  0.970149  65:67
9   entertainment      F  0.111111    1:9
10  entertainment      M  0.888889    8:9
11      executive      F  0.093750   3:32
12      executive      M  0.906250  29:32
13     healthcare      F  0.687500  11:16
14     healthcare      M  0.312500   5:16
15      homemaker      F  0.857143    6:7
16      homemaker      M  0.142857    1:7
17         lawyer      F  0.166667    1:6
18         lawyer      M  0.833333    5:6
19      librarian      F  0.568627  29:51
20      librarian      M  0.431373  22:51
21      marketing      F  0.384615   5:13
22      marketing      M  0.615385   8:13
23           none      F  0.444444    4:9
24           none      M  0.555556    5:9
25          other      F  0.342857  12:35
26          other      M  0.657143  23:35
27     programmer      F  0.090909   1:11
28     programmer      M  0.909091  10:11
29        retired      F  0.071429   1:14
30        retired      M  0.928571  13:14
31       salesman      F  0.250000    1:4
32       salesman      M  0.750000    3:4
33      scientist      F  0.096774   3:31
34      scientist      M  0.903226  28:31
35        student      F  0.306122  15:49
36        student      M  0.693878  34:49
37     technician      F  0.037037   1:27
38     technician      M  0.962963  26:27
39         writer      F  0.422222  19:45
40         writer      M  0.577778  26:45

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.