30

I am trying to impute/fill values using rows with similar columns' values.

For example, I have this dataframe:

one | two | three
1      1     10
1      1     nan
1      1     nan
1      2     nan
1      2     20
1      2     nan
1      3     nan
1      3     nan

I wanted to using the keys of column one and two which is similar and if column three is not entirely nan then impute the existing value from a row of similar keys with value in column '3'.

Here is my desired result:

one | two | three
1      1     10
1      1     10
1      1     10
1      2     20
1      2     20
1      2     20
1      3     nan
1      3     nan

You can see that keys 1 and 3 do not contain any value because the existing value does not exists.

I have tried using groupby+fillna():

df['three'] = df.groupby(['one','two'])['three'].fillna()

which gave me an error.

I have tried forward fill which give me rather strange result where it forward fill the column 2 instead. I am using this code for forward fill.

df['three'] = df.groupby(['one','two'], sort=False)['three'].ffill()

2 Answers 2

72

If only one non NaN value per group use ffill (forward filling) and bfill (backward filling) per group, so need apply with lambda:

df['three'] = df.groupby(['one','two'], sort=False)['three']
                .apply(lambda x: x.ffill().bfill())
print (df)
   one  two  three
0    1    1   10.0
1    1    1   10.0
2    1    1   10.0
3    1    2   20.0
4    1    2   20.0
5    1    2   20.0
6    1    3    NaN
7    1    3    NaN

But if multiple value per group and need replace NaN by some constant - e.g. mean by group:

print (df)
   one  two  three
0    1    1   10.0
1    1    1   40.0
2    1    1    NaN
3    1    2    NaN
4    1    2   20.0
5    1    2    NaN
6    1    3    NaN
7    1    3    NaN

df['three'] = df.groupby(['one','two'], sort=False)['three']
                .apply(lambda x: x.fillna(x.mean()))
print (df)
   one  two  three
0    1    1   10.0
1    1    1   40.0
2    1    1   25.0
3    1    2   20.0
4    1    2   20.0
5    1    2   20.0
6    1    3    NaN
7    1    3    NaN
Sign up to request clarification or add additional context in comments.

12 Comments

@jezrael: is there any reason that force to use apply in your answer? I am asking because I tried direct ffill and bfill and it returns correct result: df['three'] = df.groupby(['one', 'two'])['three'].ffill().bfill()
@Andy L. It working correct, because last group is only NaN group. If change sample data for first only NaN group (10 to NaN) , your solution failed. Reason is last bfill working not per groups, but per Series returned groupby +ffill.
ah, I forgot that the bfill back-fills the output series from ffill, not the groupby. Thanks for answers
May I ask, how can I apply df['three'] = df.groupby(['one','two'], sort=False)['three'].apply(lambda x: x.ffill().bfill()) to multiple columns three, four, five, etc instead of only three which need groupby one and two and fillna?
@ahbon - Use cols = ['three','four','five'] and df[cols] = df.groupby(['one','two'], sort=False)[cols].apply(lambda x: x.ffill().bfill())
|
2

You can sort data by the column with missing values then groupby and forwardfill:

df.sort_values('three', inplace=True)
df['three'] = df.groupby(['one','two'])['three'].ffill()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.