10

I have a DataFrame with column multi-index:

System   A                B
Trial    Exp1    Exp2     Exp1    Exp2
1        NaN     1        2       3
2        4       5        NaN     NaN
3        6       NaN      7       8

Turns out for each system (A, B) and each measurement (1, 2, 3 in index), results from Exp1 is always superior to Exp2. So I want to generate a 3rd column for each system, call it Final, that should take Exp1 whenever available, and default to Exp2 otherwise. The desired result is

System   A                       B
Trial    Exp1    Exp2    Final   Exp1    Exp2    Final
1        NaN     1       1       2       3       2
2        4       5       4       NaN     NaN     NaN
3        6       NaN     6       7       8       7

What is the best way to do this?

I've tried to use groupby on the columns:

grp = df.groupby(level=0, axis=1)

And was thinking of using either transform or apply combined by assign to achieve it. But am not able to find either a working or an efficient way of doing it. Specifically I am avoiding native python for loops for efficiency reasons (else the problem is trivial).

3 Answers 3

7

Use stack for reshape, add column with fillna and then reshape back by unstack with swaplevel + sort_index:

df = df.stack(level=0)
df['Final'] = df['Exp1'].fillna(df['Exp1'])
df = df.unstack().swaplevel(0,1,axis=1).sort_index(axis=1)
print (df)
System    A               B           
Trial  Exp1 Exp2 Final Exp1 Exp2 Final
1       NaN  1.0   NaN  2.0  3.0   2.0
2       4.0  5.0   4.0  NaN  NaN   NaN
3       6.0  NaN   6.0  7.0  8.0   7.0

Another solution with xs for select DataFrames, create new DataFrame by combine_first, but there is missing second level - was added by MultiIndex.from_product and last concat both DataFrames together:

a = df.xs('Exp1', axis=1, level=1)
b = df.xs('Exp2', axis=1, level=1)
df1 =  a.combine_first(b)
df1.columns = pd.MultiIndex.from_product([df1.columns, ['Final']])
df = pd.concat([df, df1], axis=1).sort_index(axis=1)
print (df)
System    A               B           
Trial  Exp1 Exp2 Final Exp1 Exp2 Final
1       NaN  1.0   1.0  2.0  3.0   2.0
2       4.0  5.0   4.0  NaN  NaN   NaN
3       6.0  NaN   6.0  7.0  8.0   7.0

Similar solution with rename:

a = df.xs('Exp1', axis=1, level=1, drop_level=False)
b = df.xs('Exp2', axis=1, level=1, drop_level=False)
df1 = a.rename(columns={'Exp1':'Final'}).combine_first(b.rename(columns={'Exp2':'Final'}))
df = pd.concat([df, df1], axis=1).sort_index(axis=1)
print (df)
System    A               B           
Trial  Exp1 Exp2 Final Exp1 Exp2 Final
1       NaN  1.0   1.0  2.0  3.0   2.0
2       4.0  5.0   4.0  NaN  NaN   NaN
3       6.0  NaN   6.0  7.0  8.0   7.0
Sign up to request clarification or add additional context in comments.

2 Comments

This is pure genius! Thank you!
Glad can help, I add explanation in second.
2
  • stack with your first level of the column index stack(0) leaving ['Exp1', 'Exp2'] in the column index
  • Use a lambda function that gets applied to the whole dataframe within an assign call.
  • Finally, unstack, swaplevel, sort_index to clean it up and put everything where it belongs.

f = lambda x: x.Exp1.fillna(x.Exp2)
df.stack(0).assign(Final=f).unstack() \
    .swaplevel(0, 1, 1).sort_index(1)

     A               B           
  Exp1 Exp2 Final Exp1 Exp2 Final
1  NaN  1.0   1.0  2.0  3.0   2.0
2  4.0  5.0   4.0  NaN  NaN   NaN
3  6.0  NaN   6.0  7.0  8.0   7.0

Another concept using xs

d1 = df.xs('Exp1', 1, 1).fillna(df.xs('Exp2', 1, 1))
d1.columns = [d1.columns, ['Final'] * len(d1.columns)]
pd.concat([df, d1], axis=1).sort_index(1)


     A               B           
  Exp1 Exp2 Final Exp1 Exp2 Final
1  NaN  1.0   1.0  2.0  3.0   2.0
2  4.0  5.0   4.0  NaN  NaN   NaN
3  6.0  NaN   6.0  7.0  8.0   7.0

2 Comments

Thank you! I had to choose one answer...sorry. And I like your assign() very much! Really appreciate it.
stack / unstack is inefficient for large dataframes
0

doesnt feel super optimal but try this :

for system in df.columns.levels[0]:
    df[(system, 'final')] = df[(system, 'Exp1')].fillna(df[(system, 'Exp2')])

2 Comments

Thx @Steven, though I forgot to mention I was able to accomplish this using loops. But I'm trying to explicitly shun away from for statements as it beats the purpose of Pandas...
Don't forget sort_index, as the new columns will be appended to the very last.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.