I want to efficiently use values from a DataFrame's MultiIndex in some calculations. For example, starting with:
np.random.seed(456)
j = [(a, b) for a in ['A','B','C'] for b in random.sample(pd.date_range('2017-01-01', periods=50, freq='W').tolist(), 5)]
i = pd.MultiIndex.from_tuples(j, names=['Name','Num'])
df = pd.DataFrame(np.random.randn(15), i, columns=['Vals'])
df['SmallestNum'] = df.reset_index(level=1).groupby('Name')['Num'].transform('min').values
Suppose I want to calculate a new column Diff = Num - SmallestNum. An efficient but, I assume, kludgy way is to copy the Index level I want to reference into a bona fide column and then do the difference:
df['NumCol'] = df.index.get_level_values(1)
df['Diff'] = df['NumCol'] - df['SmallestNum']
But I feel like I'm still not understanding the proper way to work with DataFrames if I'm doing this. I thought the "correct" solution would look like either of the following, which don't create and store a full copy of the index values:
df['Diff'] = df.transform(lambda x: x.index.get_level_values(1) - x['SmallestNum'])
df['Diff'] = df.reset_index(level=1).apply(lambda x: x['Num'] - x['SmallestNum'])
... however not only do neither of these expressions work*, but also my understanding is that DataFrame operations like .transform or .apply are bound to be significantly slower than ones that operate on explicit "vectorized" row references.
So what is the "correct and efficient" way to write the calculation for the new Diff column in this example?
* Update: This problem was compounded by the fact (possibly bug) that the index level 1 values were not unique, which causes formulas that work when the index values are unique to fail with NotImplementedError: Index._join_level on non-unique index is not implemented. Fortunately jezrael's answer contains workarounds that appear to be as efficient as explicitly vectorized calculation.