0

I have some nan cells in the year column. I thought it'll be better to set the Mode of the year column grouped-By Industry type.

df
ID  Name     Industry    Year   Employees Expenses   Profit
1   E-Zim    Health      2019   320       1,130,700  8553827
2   Daltfase Software    NaN    78        804,035    13212508
3   Hotlane  Government  2012   87        1,044,375  8701897
4   Latho    Health      NaN    103       4,631,808  10727561
5   Lambam   IT Services 2015   210       4,374,841  4193069
6   Quozap   Health      2008   21        4,626,275  8179177
7   Tampware Health      2008   13        2,127,984  3259485

For mode values, i did:

df_mode_year = df.groupby('Industry')['Year'].apply(lambda x: x.mode().iloc[0])
df_mode_year

Industry
Government             2012
Health                 2008
IT Services            2015
Software

And then to modify my df, i tried

  1. df['Year'].fillna(df_mode_year)
  2. df['Year'] = df['Year'].fillna(df_mode_year[df['Industry']=='Health'])

But none of these two are affecting the final df .

Expected Output:

df
ID  Name     Industry    Year   Employees Expenses   Profit
1   E-Zim    Health      2019   320       1,130,700  8553827
2   Daltfase Software    NaN    78        804,035    13212508
3   Hotlane  Government  2012   87        1,044,375  8701897
4   Latho    Health      2008   103       4,631,808  10727561
5   Lambam   IT Services 2012   210       4,374,841  4193069
6   Quozap   Health      2008   21        4,626,275  8179177
7   Tampware Health      2008   13        2,127,984  3259485

What am i doing wrong? Thanks a lot.

2 Answers 2

1

You can use .transform() instead of .apply(), which will return a series of the same length as df, with the mode of each group

mode = df.groupby('Industry').Year
mode = mode.transform(lambda x: x.mode().squeeze())
df.update(mode, overwrite=False)

.update() operations are always done inplace and return None


Edit

The code in the first section works fine if all Industries have at least one year, but it seems .transform(lambda x: x.mode().squeeze()) raises an error when it encounters a group with only NaN as values. If this is a possibility in the data, then the easiest is to take your approach to summarize mode by Industry:

mode = df.groupby('Industry').Year.apply(lambda x: x.mode().squeeze())

The problem now is that .squeeze() returns an empty Series and .iloc[0] (as in your original code) raises an exception.

>>> print(mode.apply(type))

Industry
Government                 <class 'numpy.float64'>
Health                     <class 'numpy.float64'>
IT Services                <class 'numpy.float64'>
Software       <class 'pandas.core.series.Series'>
Name: Year, dtype: object

>>> mode = df.groupby('Industry').Year.apply(lambda x: x.mode().iloc[0])
Traceback ...
IndexError: single positional indexer is out-of-bounds

So now we need to fix that Software value. Since it is an iterable, it cannot be replaced, but it can be iterated upon:

mode = mode.explode()

now mode is just like we need it

>>> print(mode)
Industry
Government     2012
Health         2008
IT Services    2015
Software        NaN
Name: Year, dtype: object

And for the final trick of the evening:

df = df.merge(mode, on='Industry') # default suffixes: ['_x', '_y']
mode = df.pop('Year_y').rename('Year')
df.rename({'Year_x': 'Year'}, axis=1, inplace=True)
df.update(mode, overwrite=False)

Final Output

   ID      Name     Industry  Year  Employees   Expenses    Profit
0   1     E-Zim       Health  2019        320  1,130,700   8553827
1   4     Latho       Health  2008        103  4,631,808  10727561
2   6    Quozap       Health  2008         21  4,626,275   8179177
3   7  Tampware       Health  2008         13  2,127,984   3259485
4   2  Daltfase     Software   NaN         78    804,035  13212508
5   3   Hotlane   Government  2012         87  1,044,375   8701897
6   5    Lambam  IT Services  2015        210  4,374,841   4193069

Notice how Software remains NaN and health updated only NaNs

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for the insight. The inner groupby is throwing an Error saying: Length of passed values is 1, index implies 50
0

I would try pivot and melt. (You may need to convert df_mode_year to a map first)

df1 = df.pivot(index = 'ID', columns = 'Industry', values = 'Year').reset_index(drop = True)
df1 = df1.fillna(df_mode_year)
df1 = df1.melt(id_vars = 'ID', var_name='Industry', value_name='Year')

You can then concat this df1 with other columns from df to get the desired table.

2 Comments

Thanks for the reply. Running this i do get an error: KeyError: 'ID' . What map are you referring to!
wouldn't this create a full all Industries by all IDs matrix and increase the number of rows on melt?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.