Pandas Multiindex Rows and Columns: Replace NaN with Value from Matching Row

Question

Given the following:

import pandas as pd
import numpy as np
df=pd.DataFrame({'County':['A','B','A','B','A','B','A','B','A','B'],
                'Hospital':['a','b','e','f','i','j','m','n','b','r'],
                'Enrollment':[44,55,95,54,81,54,89,76,1,67],
                'Year':['2012','2012','2012','2012','2012','2013',
                        '2013','2013','2013','2013']})
d2=pd.pivot_table(df,index=['County','Hospital'],columns=['Year'])#.sort_columns

d2

        Enrollment
       Year     2012    2013
County Hospital         
A       a       44.0    NaN
        b       NaN     1.0
        e       95.0    NaN
        i       81.0    NaN
        m       NaN     89.0
B       b       55.0    NaN
        f       54.0    NaN
        j       NaN     54.0
        n       NaN     76.0
        r       NaN     67.0

If a hospital such as 'b' exists more than once and it has no data for the previous year (the first occurrence of 'b'), I'd like to assign the previous year value of Enrollment for the other row ('b') and delete the row of 'b' that does not contain data for the first year like this:

        Enrollment
       Year     2012    2013
County Hospital         
A       a       44.0    NaN
        b       55.0    1.0
        e       95.0    NaN
        i       81.0    NaN
        m       NaN     89.0
B       f       54.0    NaN
        j       NaN     54.0
        n       NaN     76.0
        r       NaN     67.0

So far, I can identify the duplicate row and delete, but I'm just stuck on replacing NaN with a value where needed:

Identify duplicate hospitals after resetting the index:

d2=d2.reset_index()    
d2['dup']=d2.duplicated('Hospital',keep=False)

Flag, for deletion, the duplicate hospital with no data for the most recent year:

 Hospital=d2.columns.levels[0][1]

 Y1=d2.columns.levels[1][0]

 Y2=d2.columns.levels[1][1]

 d2['Delete']=np.nan

 d2.loc[(pd.isnull(d2.Enrollment[Y2]))&(d2['dup']==True),'Delete']='Yes'

Keep all but rows to delete:
```
 d2=d2.loc[d2['Delete']!='Yes']
```

unutbu · Accepted Answer · 2016-06-21 17:43:06Z

3

If I understand correctly, the problem is to copy values from County B into County A when the Hospital matches. That can be done with groupby/fillna(method='bfill'). The bfill method backfills NaNs with the closest succeeding non-NaN value.

Then, you can use d2.drop_duplicates(subset=['Hospital'], keep='first') to keep the first row when the Hospital matches.

For example,

import pandas as pd

df = pd.DataFrame({'County': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'],
                   'Hospital': ['a', 'b', 'e', 'f', 'i', 'j', 'm', 'n', 'b', 'r'],
                   'Enrollment': [44, 55, 95, 54, 81, 54, 89, 76, 1, 67],
                   'Year': ['2012', '2012', '2012', '2012', '2012', '2013',
                            '2013', '2013', '2013', '2013']})
d2 = pd.pivot_table(df, index=['County', 'Hospital'], columns=['Year'])  
d2 = d2.groupby(level='Hospital').fillna(method='bfill')
d2 = d2.reset_index()
d2 = d2.drop_duplicates(subset=['Hospital'], keep='first')

yields

     County Hospital Enrollment      
Year                       2012  2013
0         A        a       44.0   NaN
1         A        b       55.0   1.0
2         A        e       95.0   NaN
3         A        i       81.0   NaN
4         A        m        NaN  89.0
6         B        f       54.0   NaN
7         B        j        NaN  54.0
8         B        n        NaN  76.0
9         B        r        NaN  67.0

answered Jun 21, 2016 at 17:43

unutbu

886k197 gold badges1.9k silver badges1.7k bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Dance Party2 Over a year ago

Close, but the criteria for the hospital row to keep is based on which one lacks data for the first year and has data for the second year (the one meeting those two criteria is the one to keep).

piRSquared · Accepted Answer · 2016-06-21 16:40:42Z

1

manipulate d2 to put A and B side by side.

e = d2.unstack(0).swaplevel(1, 2, 1).sort_index(1).Enrollment
print e

County       A           B      
Year      2012  2013  2012  2013
Hospital                        
a         44.0   NaN   NaN   NaN
b          NaN   1.0  55.0   NaN
e         95.0   NaN   NaN   NaN
f          NaN   NaN  54.0   NaN
i         81.0   NaN   NaN   NaN
j          NaN   NaN   NaN  54.0
m          NaN  89.0   NaN   NaN
n          NaN   NaN   NaN  76.0
r          NaN   NaN   NaN  67.0

create an apply function to assign values from B and nullify B afterwards.

def manipulate_rows(row):
    if pd.notnull(row.loc['A'].iloc[1]) & pd.isnull(row.loc['A'].iloc[0]):
        row.A = row.A.combine_first(row.B)
        row.B = np.nan
    return row

d3 = e.apply(manipulate_rows, axis=1).stack(0).swaplevel(0, 1).sort_index()

Stacking will naturally drop the missing values after the pivot.

Reassign the d2 columns

d3.columns = d2.columns

print d3

                Enrollment      
Year                  2012  2013
County Hospital                 
A      a              44.0   NaN
       b              55.0   1.0
       e              95.0   NaN
       i              81.0   NaN
       m               NaN  89.0
B      f              54.0   NaN
       j               NaN  54.0
       n               NaN  76.0
       r               NaN  67.0

answered Jun 21, 2016 at 16:40

piRSquared

296k68 gold badges509 silver badges654 bronze badges

4 Comments

piRSquared Over a year ago

that almost certainly is coming from the .swaplevel(1, 2, 1) portion. After the .unstack(0). It implies that the dataframe you called this on, only had one level in its index which means you called it on the wrong dataframe or it is different than what was represented in the question.

Dance Party2 Over a year ago

You're right; I forgot that I had added reset_index() to my original code. About row.loc['A'] in the function definition; is there a way to avoid referring to the label itself? The duplicate row may occur in any of n number of groups.

piRSquared Over a year ago

At the moment, I think I'd have to rework the logic. row.loc['A'] is taking advantage of how loc processes MultiIndex. An immediate work around would be to pass the column names: def manipulate_row(row, to='A', from='B') or something similar.

Dance Party2 Over a year ago

Is it possible to just do a groupby and max on the original DF to replace the NaN?

Collectives™ on Stack Overflow

Pandas Multiindex Rows and Columns: Replace NaN with Value from Matching Row

2 Answers 2

1 Comment

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related