0

bear with me here, i'm new to coding and stack overflow - so i may not be able to describe this succinctly.

i have 2 dataframes, df1 which is being read from a CSV, and has a column (named 'total_income' with NaN values that i want to fill in with a mean based on categorical values in another column (named 'income type').

i created a second data frame with just those two columns, to do a groupby on the 'income_type' column and then found the mean of the 'total_income' column for each unique value in the 'income type' column

i used this:

#groupby types of employment, calculate/fill in the mean
df2 = df1[['income_type','total_income']]
df2 = df2.sort_values('income_type').groupby('income_type').mean().reset_index()

my output is exactly what i am looking for:

                       income_type  total_income
0                     business  32386.793835
1                civil servant  27343.729582
2                     employee  25820.841683
3                 entrepreneur  79866.103000
4  paternity / maternity leave   8612.661000
5                      retiree  21940.394503
6                      student  15712.260000
7                   unemployed  21014.360500

however, now that i have these outputs, i am confused as to how to apply them back to df1 in the same associated columns - so, for all rows that have 'employee' in df1's 'income_type' i want to apply the value 25820.84 (index #2 from df2: 2 employee 25820.841683

I know i can do this one by one by setting variables for each income type,total_income pair, but if i could do it in a loop or function that would be much more tidy.

here is the first 10 columns of df1:

children days_employed dob   education      edu_id family_status family_sts_id gender income_type  debt  total_income
   1     -8437.673028   42  bachelor's degree      0       married          0      F    employee     0     40620.102   
   1     -4024.803754   36  secondary education    1       married          0      F    employee     0     17932.802   
   0     -5623.422610   33  Secondary Education    1       married          0      M    employee     0     23341.752   
   3     -4124.747207   32  secondary education    1       married          0      M    employee     0     42820.568   
   0      340266.072047 53  secondary education    1       civil partnership    1      F     retiree     0     25378.572   
   0     -926.185831    27  bachelor's degree      0       civil partnership    1      M    business     0     40922.170   
   0     -2879.202052   43  bachelor's degree      0       married          0      F    business     0     38484.156   
   0     -152.779569    50  SECONDARY EDUCATION    1       married          0      M    employee     0     21731.829   
   2     -6929.865299   35  BACHELOR'S DEGREE      0       civil partnership    1      F    employee     0     15337.093   
   0     -2188.756445   41  secondary education    1       married          0      M    employee     0     23108.150 

I asked another question the other day that i tried to figure out how to apply to this:Link Here

But i am still struggling with how to take it from one df to the other in a loop with a replace. any assistance would be greatly appreciated!

1 Answer 1

1

You can take a way simpler approach like:

df['total_income'] = df.groupby('income_type')['total_income'].transform(lambda x: x.fillna(x.mean()))

That said, if you really want to create a second dataframe as you've done, you can leave it as pd.Series and use it to update the original df1 like this:

df2 = df1[['income_type','total_income']].copy()
income_mapper = df2.sort_values('income_type').groupby('income_type')['total_income'].mean()
df1.loc[df1.total_income.isna(), 'total_income'] = df1.loc[df1.total_income.isna(), 'income_type'].map(income_mapper)
Sign up to request clarification or add additional context in comments.

2 Comments

wow! thanks! i am all for simpler. i voted your answer up, but i don't have enough posts for it to show.
thanks for the additional info - but the only reason i created the second data frame in this instance was because i was thinking that having those values separate would make it easier to backfull into the first dataframe (obviously, that was not true. :D)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.