2

I have a dataframe where certain essential columns are NULL (that I need for further machine learning work). I have another dataframe with similar data from where I want to pull in the missing values from.

For example, df1 is the main dataframe

id     col1    col2     col3     col4    col5
1      A       AA       100      5.0     0.9
2      A       BB       150      4.2     0.5
3      A       CC       100      NaN     NaN
4      B       AA       300      NaN     NaN
5      B       BB       100      NaN     NaN
6      C       BB       50       3.4     0.6

The dataframe that I want to fill those NaN columns in col4 and col5 could be like

id     col1    col3     col4    col5
100      A     100      4.5     1.0
101      A     100      3.5     0.8
103      B     300      5.0     0.5
105      B     300      5.5     0.8
106      B     100      5.3     0.2
107      C     100      3.0     1.2

So, I don't have col2 in the second df and there are duplicates for the col1 and col2 columns that I can merge by. So, I have to choose the value with the maximum col4 value to fill the corresponding values in df1.

For example, the correct value for df1 after filling in the data would be:

id     col1    col2     col3     col4    col5
1      A       AA       100      5.0     0.9
1      A       BB       150      4.2     0.5
1      A       CC       100      4.5     1.0
1      B       AA       300      5.5     0.8
1      B       BB       100      5.3     0.2
1      C       BB       50       3.4     0.6

How would I do that?

2
  • Does the maximum value in col5 always occur in the same rows as the maximum values in col4? Commented Oct 27, 2016 at 8:33
  • @unutbu no not necessarily Commented Oct 27, 2016 at 8:35

2 Answers 2

1

IIUC

df1.combine_first(
    df1.merge(
        df2.drop('id', 1).ix[df2.groupby(['col1', 'col3']).col4.idxmax()],
        on=['col1', 'col3'], how='left', suffixes=['_', '']
    )[['col4', 'col5']]
).reindex_axis(df1.columns, 1)

enter image description here

Sign up to request clarification or add additional context in comments.

Comments

1
import numpy as np
import pandas as pd
nan = np.nan

df1 = pd.DataFrame({'col1': ['A', 'A', 'A', 'B', 'B', 'C'], 'col2': ['AA', 'BB', 'CC', 'AA', 'BB', 'BB'], 'col3': [100, 150, 100, 300, 100, 50], 'col4': [5.0, 4.2, nan, nan, nan, 3.4], 'col5': [0.9, 0.5, nan, nan, nan, 0.6], 'id': [1, 2, 3, 4, 5, 6]})
df2 = pd.DataFrame({'col1': ['A', 'A', 'B', 'B', 'B', 'C'], 'col3': [100, 100, 300, 300, 100, 100], 'col4': [4.5, 3.5, 5.0, 5.5, 5.3, 3.0], 'col5': [1.0, 99, 0.5, 0.8, 0.2, 1.2], 'id': [100, 101, 103, 105, 106, 107]})

df2_max = df2.drop('id', axis=1).groupby(['col1','col3']).max()
df3 = pd.merge(df1[['col1','col3']], df2_max, 
               left_on=['col1','col3'], right_index=True, how='left')
result = df1.combine_first(df3)

yields

  col1 col2  col3  col4  col5  id
0    A   AA   100   5.0   0.9   1
1    A   BB   150   4.2   0.5   2
2    A   CC   100   4.5  99.0   3
3    B   AA   300   5.5   0.8   4
4    B   BB   100   5.3   0.2   5
5    C   BB    50   3.4   0.6   6

First, find the max of df2's col4 and col5 columns for each value of col1,col3:

df2_max = df2.drop('id', axis=1).groupby(['col1','col3']).max()
#            col4  col5
# col1 col3            
# A    100    4.5  99.0
# B    100    5.3   0.2
#      300    5.5   0.8
# C    100    3.0   1.2

Notice the 99 in the first row (instead of 0.8). I changed the example slightly, to show that the col4 maximum need not occur in the same row as the col5 maximum.

Next, merge df1 and df2_max, on df1's col1,col3 columns, and df2's index:

df3 = pd.merge(df1[['col1','col3']], df2_max, 
               left_on=['col1','col3'], right_index=True, how='left')
#   col1  col3  col4  col5
# 0    A   100   4.5  99.0
# 1    A   150   NaN   NaN
# 2    A   100   4.5  99.0
# 3    B   300   5.5   0.8
# 4    B   100   5.3   0.2
# 5    C    50   NaN   NaN

This provides us with the maximums in col4 and col5 in a DataFrame whose index matches df1's index. This allows us to use df1.combine_first to fill in the NaNs with values from df3:

result = df1.combine_first(df3)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.