Consider the following one-column table indexed by pairs (A, B):
Value
A B
a1 b1 1
b2 3
a2 b1 10
b2 30
a3 b1 100
For a particular value of B, like b2, I would like to get a table that lists all its values together with an extra column that gives the difference between b2 and b1 if the corresponding value for b1 or b2 exists. So the result should be:
b2 b2-b1
a1 3.0 2.0
a2 30.0 20.0
a3 NaN NaN
a4 40.0 NaN
where NaN in both columns for a3 tells that values for both b1 and b2 are missing while NaN for a4 tells that only b1 is missing.
Initially I tried the solution based on Calculation between groups in a Pandas multiindex dataframe :
import numpy as np
import pandas as pd
import itertools
index = pd.MultiIndex.from_tuples([
('a1', 'b1'),
('a1', 'b2'),
('a2', 'b1'),
('a2', 'b2'),
('a2', 'b3'),
('a3', 'b1'),
('a4', 'b2'),
], names=['A', 'B'])
table = pd.DataFrame({'Value': [1,3, 10, 30, 31, 100, 40]}, index=index)
tmp = pd.concat([table, table.groupby(level='A')['Value'].transform(lambda x: x.loc[:, 'b2'] - x.loc[:, 'b1'])], axis=1)
b2 = tmp.groupby(level='B').get_group('b2')
b2.index = b2.index.droplevel(level='B')
print(b2)
But that gives a table without the a3 row (column names can be fixed with renames):
Value Value
A
a1 3 2.0
a2 30 20.0
a4 40 NaN
The reason for this is that it seems the transform preserves the original index and the extra NaN are not inserted into the table, right? Is it possible to fix this?
So I tried an alternative with explicit diff between columns in groups:
import numpy as np
import pandas as pd
import itertools
index = pd.MultiIndex.from_tuples([
('a1', 'b1'),
('a1', 'b2'),
('a2', 'b1'),
('a2', 'b2'),
('a3', 'b1'),
('a4', 'b2'),
], names=['A', 'B'])
input_table = pd.DataFrame({'Value': [1,3, 10, 30, 100, 40]}, index=index)
grouped = input_table.groupby(level='B')
b1 = grouped.get_group('b1')
b1.index = b1.index.droplevel(level='B')
b2 = grouped.get_group('b2')
b2.index = b2.index.droplevel(level='B')
b2 = b2.rename(columns={'Value': 'b2'})
b2 = pd.concat([b2, b2['b2'] - b1['Value']], axis=1)
b2.rename(columns={0: 'b2-b1'}, inplace=True)
print(b2)
This works but it seems rather complicated with all that code to drop levels so the diff between columns work and separated column renames. Is it possible to make it simpler?