Pandas: How to fill in null values from columns in another dataframe?

Question

I have a dataframe where certain essential columns are NULL (that I need for further machine learning work). I have another dataframe with similar data from where I want to pull in the missing values from.

For example, df1 is the main dataframe

id     col1    col2     col3     col4    col5
1      A       AA       100      5.0     0.9
2      A       BB       150      4.2     0.5
3      A       CC       100      NaN     NaN
4      B       AA       300      NaN     NaN
5      B       BB       100      NaN     NaN
6      C       BB       50       3.4     0.6

The dataframe that I want to fill those NaN columns in col4 and col5 could be like

id     col1    col3     col4    col5
100      A     100      4.5     1.0
101      A     100      3.5     0.8
103      B     300      5.0     0.5
105      B     300      5.5     0.8
106      B     100      5.3     0.2
107      C     100      3.0     1.2

So, I don't have col2 in the second df and there are duplicates for the col1 and col2 columns that I can merge by. So, I have to choose the value with the maximum col4 value to fill the corresponding values in df1.

For example, the correct value for df1 after filling in the data would be:

id     col1    col2     col3     col4    col5
1      A       AA       100      5.0     0.9
1      A       BB       150      4.2     0.5
1      A       CC       100      4.5     1.0
1      B       AA       300      5.5     0.8
1      B       BB       100      5.3     0.2
1      C       BB       50       3.4     0.6

How would I do that?

Does the maximum value in col5 always occur in the same rows as the maximum values in col4? — unutbu
– unutbu, Commented Oct 27, 2016 at 8:33

piRSquared · Accepted Answer · 2016-10-27 07:38:06Z

1

IIUC

df1.combine_first(
    df1.merge(
        df2.drop('id', 1).ix[df2.groupby(['col1', 'col3']).col4.idxmax()],
        on=['col1', 'col3'], how='left', suffixes=['_', '']
    )[['col4', 'col5']]
).reindex_axis(df1.columns, 1)

answered Oct 27, 2016 at 7:38

piRSquared

296k68 gold badges509 silver badges654 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

unutbu · Accepted Answer · 2016-10-27 11:10:10Z

import numpy as np
import pandas as pd
nan = np.nan

df1 = pd.DataFrame({'col1': ['A', 'A', 'A', 'B', 'B', 'C'], 'col2': ['AA', 'BB', 'CC', 'AA', 'BB', 'BB'], 'col3': [100, 150, 100, 300, 100, 50], 'col4': [5.0, 4.2, nan, nan, nan, 3.4], 'col5': [0.9, 0.5, nan, nan, nan, 0.6], 'id': [1, 2, 3, 4, 5, 6]})
df2 = pd.DataFrame({'col1': ['A', 'A', 'B', 'B', 'B', 'C'], 'col3': [100, 100, 300, 300, 100, 100], 'col4': [4.5, 3.5, 5.0, 5.5, 5.3, 3.0], 'col5': [1.0, 99, 0.5, 0.8, 0.2, 1.2], 'id': [100, 101, 103, 105, 106, 107]})

df2_max = df2.drop('id', axis=1).groupby(['col1','col3']).max()
df3 = pd.merge(df1[['col1','col3']], df2_max, 
               left_on=['col1','col3'], right_index=True, how='left')
result = df1.combine_first(df3)

yields

  col1 col2  col3  col4  col5  id
0    A   AA   100   5.0   0.9   1
1    A   BB   150   4.2   0.5   2
2    A   CC   100   4.5  99.0   3
3    B   AA   300   5.5   0.8   4
4    B   BB   100   5.3   0.2   5
5    C   BB    50   3.4   0.6   6

First, find the max of df2's col4 and col5 columns for each value of col1,col3:

df2_max = df2.drop('id', axis=1).groupby(['col1','col3']).max()
#            col4  col5
# col1 col3            
# A    100    4.5  99.0
# B    100    5.3   0.2
#      300    5.5   0.8
# C    100    3.0   1.2

Notice the 99 in the first row (instead of 0.8). I changed the example slightly, to show that the col4 maximum need not occur in the same row as the col5 maximum.

Next, merge df1 and df2_max, on df1's col1,col3 columns, and df2's index:

df3 = pd.merge(df1[['col1','col3']], df2_max, 
               left_on=['col1','col3'], right_index=True, how='left')
#   col1  col3  col4  col5
# 0    A   100   4.5  99.0
# 1    A   150   NaN   NaN
# 2    A   100   4.5  99.0
# 3    B   300   5.5   0.8
# 4    B   100   5.3   0.2
# 5    C    50   NaN   NaN

This provides us with the maximums in col4 and col5 in a DataFrame whose index matches df1's index. This allows us to use df1.combine_first to fill in the NaNs with values from df3:

result = df1.combine_first(df3)

Collectives™ on Stack Overflow

Pandas: How to fill in null values from columns in another dataframe?

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related