MultiIndex and operations between particular groups in pandas and missing values

Question

Consider the following one-column table indexed by pairs (A, B):

       Value
A  B        
a1 b1      1
   b2      3
a2 b1     10
   b2     30
a3 b1    100

For a particular value of B, like b2, I would like to get a table that lists all its values together with an extra column that gives the difference between b2 and b1 if the corresponding value for b1 or b2 exists. So the result should be:

      b2  b2-b1
a1   3.0    2.0
a2  30.0   20.0
a3   NaN    NaN
a4  40.0    NaN

where NaN in both columns for a3 tells that values for both b1 and b2 are missing while NaN for a4 tells that only b1 is missing.

Initially I tried the solution based on Calculation between groups in a Pandas multiindex dataframe :

import numpy as np
import pandas as pd
import itertools

index = pd.MultiIndex.from_tuples([
    ('a1', 'b1'), 
    ('a1', 'b2'), 
    ('a2', 'b1'),
    ('a2', 'b2'),
    ('a2', 'b3'),
    ('a3', 'b1'),
    ('a4', 'b2'),
], names=['A', 'B'])

table = pd.DataFrame({'Value': [1,3, 10, 30, 31, 100, 40]}, index=index)
tmp = pd.concat([table, table.groupby(level='A')['Value'].transform(lambda x: x.loc[:, 'b2'] - x.loc[:, 'b1'])], axis=1)

b2 = tmp.groupby(level='B').get_group('b2')
b2.index = b2.index.droplevel(level='B')

print(b2)

But that gives a table without the a3 row (column names can be fixed with renames):

    Value  Value
A               
a1      3    2.0
a2     30   20.0
a4     40    NaN

The reason for this is that it seems the transform preserves the original index and the extra NaN are not inserted into the table, right? Is it possible to fix this?

So I tried an alternative with explicit diff between columns in groups:

import numpy as np
import pandas as pd
import itertools

index = pd.MultiIndex.from_tuples([
    ('a1', 'b1'), 
    ('a1', 'b2'), 
    ('a2', 'b1'),
    ('a2', 'b2'),
    ('a3', 'b1'),
    ('a4', 'b2'),
], names=['A', 'B'])

input_table = pd.DataFrame({'Value': [1,3, 10, 30, 100, 40]}, index=index)

grouped = input_table.groupby(level='B')
b1 = grouped.get_group('b1')
b1.index = b1.index.droplevel(level='B')

b2 = grouped.get_group('b2')
b2.index = b2.index.droplevel(level='B')
b2 = b2.rename(columns={'Value': 'b2'})

b2 = pd.concat([b2, b2['b2'] - b1['Value']], axis=1)
b2.rename(columns={0: 'b2-b1'}, inplace=True)

print(b2)

This works but it seems rather complicated with all that code to drop levels so the diff between columns work and separated column renames. Is it possible to make it simpler?

Zero · Accepted Answer · 2017-09-15 10:11:35Z

1

Using unstack to get B onto columns and assign to create b2-b1 and drop to remove unneeded columns.

In [1120]: table.unstack()['Value'].assign(b2b1=lambda x: x.b2-x.b1).drop(['b1', 'b3'], 1)
Out[1120]:
B     b2  b2b1
A
a1   3.0   2.0
a2  30.0  20.0
a3   NaN   NaN
a4  40.0   NaN

To get rid of index names use rename_axis

In [1123]: (table.unstack()['Value'].assign(b2b1=lambda x: x.b2-x.b1).drop(['b1', 'b3'], 1)
                 .rename_axis(None).rename_axis(None, 1))
Out[1123]:
      b2  b2b1
a1   3.0   2.0
a2  30.0  20.0
a3   NaN   NaN
a4  40.0   NaN

Or, you can store the unstack result

In [1127]: dff = table.unstack()['Value'].rename_axis(None).rename_axis(None, 1)

In [1128]: dff['b2-b1'] = dff['b2'] - dff['b1']

In [1129]: dff
Out[1129]:
       b1    b2    b3  b2-b1
a1    1.0   3.0   NaN    2.0
a2   10.0  30.0  31.0   20.0
a3  100.0   NaN   NaN    NaN
a4    NaN  40.0   NaN    NaN

In [1189]: dff[['b1', 'b2-b1']]  # get specific columns
Out[1189]:
       b1  b2-b1
a1    1.0    2.0
a2   10.0   20.0
a3  100.0    NaN
a4    NaN    NaN

Details

In [1124]: table.unstack()
Out[1124]:
    Value
B      b1    b2    b3
A
a1    1.0   3.0   NaN
a2   10.0  30.0  31.0
a3  100.0   NaN   NaN
a4    NaN  40.0   NaN

edited Sep 15, 2017 at 10:11

answered Sep 15, 2017 at 9:00

Zero

77.4k22 gold badges153 silver badges153 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Igor Bukanov Over a year ago

Is it possible to avoid hardcoding the names of columns to drop? I.e. drop all but 'b1' columns? In my case the table can potentially have quite a few B values.

Igor Bukanov Over a year ago

Thanks! Now I got the usability of unstack :)

Collectives™ on Stack Overflow

MultiIndex and operations between particular groups in pandas and missing values

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related