7

I'm trying to get the first non null value from multiple pandas series in a dataframe.

df = pd.DataFrame({'a':[2, np.nan, np.nan, np.nan],
              'b':[np.nan, 5, np.nan, np.nan],
              'c':[np.nan, 55, 13, 14],
              'd':[np.nan, np.nan, np.nan, 4],
              'e':[12, np.nan, np.nan, 22],
          })

     a    b     c    d     e
0  2.0  NaN   NaN  NaN  12.0
1  NaN  5.0  55.0  NaN   NaN
2  NaN  NaN  13.0  NaN   NaN
3  NaN  NaN  14.0  4.0  22.0

in this df I want to create a new column 'f', and set it equal to 'a' if a is not null, 'b' if b is not null etc. down to e.

I could do a bunch of np.where statements which is inefficient.

df['f'] = np.where(df.a.notnull(), df.a,
              np.where(df.b.notnull(), df.b,
                   etc.))

I looked into doing df.a or df.b or df.c etc.

result should look like:

     a    b     c    d     e   f
0  2.0  NaN   NaN  NaN  12.0   2
1  NaN  5.0  55.0  NaN   NaN   5
2  NaN  NaN  13.0  NaN   NaN  13
3  NaN  NaN  14.0  4.0  22.0  14

3 Answers 3

9

One solution

df.groupby(['f']*df.shape[1], axis=1).first()
Out[385]: 
      f
0   2.0
1   5.0
2  13.0
3  14.0

The orther

df.bfill(1)['a']
Out[388]: 
0     2.0
1     5.0
2    13.0
3    14.0
Name: a, dtype: float64
Sign up to request clarification or add additional context in comments.

2 Comments

Nice, but no need for a numpy array: df.groupby(['f']*df.shape[1], axis=1).first(); I also name axis=1 because I don't know the order of the arguments of groupby by heart :)
df.bfill(1)['a'] seems to be the most efficient though!!
4

You could also use first_valid_index

In [336]: df.apply(lambda x: x.loc[x.first_valid_index()], axis=1)
Out[336]:
0     2.0
1     5.0
2    13.0
3    14.0
dtype: float64

Or, stack and groupby

In [359]: df.stack().groupby(level=0).first()
Out[359]:
0     2.0
1     5.0
2    13.0
3    14.0
dtype: float64

Or, first_valid_index with lookup

In [355]: df.lookup(df.index, df.apply(pd.Series.first_valid_index, axis=1))
Out[355]: array([ 2.,  5., 13., 14.])

1 Comment

Note that df.first_valid_index() is to be used with df.loc (and not df.iloc).
1

You can also use numpy for this:

first_valid = (~np.isnan(df.values)).argmax(1)

Then use indexing:

df.assign(valid=df.values[range(len(first_valid)), first_valid])

     a    b     c    d     e  valid
0  2.0  NaN   NaN  NaN  12.0    2.0
1  NaN  5.0  55.0  NaN   NaN    5.0
2  NaN  NaN  13.0  NaN   NaN   13.0
3  NaN  NaN  14.0  4.0  22.0   14.0

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.