Pandas MultiIndex DataFrame reference index value in column calculation

Question

I want to efficiently use values from a DataFrame's MultiIndex in some calculations. For example, starting with:

np.random.seed(456)
j = [(a, b) for a in ['A','B','C'] for b in random.sample(pd.date_range('2017-01-01', periods=50, freq='W').tolist(), 5)]
i = pd.MultiIndex.from_tuples(j, names=['Name','Num'])
df = pd.DataFrame(np.random.randn(15), i, columns=['Vals'])
df['SmallestNum'] = df.reset_index(level=1).groupby('Name')['Num'].transform('min').values

Suppose I want to calculate a new column Diff = Num - SmallestNum. An efficient but, I assume, kludgy way is to copy the Index level I want to reference into a bona fide column and then do the difference:

df['NumCol'] = df.index.get_level_values(1)
df['Diff'] = df['NumCol'] - df['SmallestNum']

But I feel like I'm still not understanding the proper way to work with DataFrames if I'm doing this. I thought the "correct" solution would look like either of the following, which don't create and store a full copy of the index values:

df['Diff'] = df.transform(lambda x: x.index.get_level_values(1) - x['SmallestNum'])
df['Diff'] = df.reset_index(level=1).apply(lambda x: x['Num'] - x['SmallestNum'])

... however not only do neither of these expressions work*, but also my understanding is that DataFrame operations like .transform or .apply are bound to be significantly slower than ones that operate on explicit "vectorized" row references.

So what is the "correct and efficient" way to write the calculation for the new Diff column in this example?

^* Update: This problem was compounded by the fact (possibly bug) that the index level 1 values were not unique, which causes formulas that work when the index values are unique to fail with NotImplementedError: Index._join_level on non-unique index is not implemented. Fortunately jezrael's answer contains workarounds that appear to be as efficient as explicitly vectorized calculation.

jezrael · Accepted Answer · 2018-02-25 20:35:15Z

1

I think you need simply subtract:

df['Diff'] = df.index.get_level_values(1) - df['SmallestNum']
print (df)

              Vals  SmallestNum  Diff
Name Num                             
A    28   1.180140           28     0
     44   0.984257           28    16
     90   1.835646           28    62
     43  -1.886823           28    15
     29   0.424763           28     1
B    80  -0.433105           38    42
     61  -0.166838           38    23
     46   0.754634           38     8
     38   1.966975           38     0
     93   0.200671           38    55
C    40   0.742752           12    28
     82  -1.264271           12    70
     12  -0.112787           12     0
     78   0.667358           12    66
     70   0.357900           12    58

EDIT: for non unique DatetimeIndex in second level working subtract numpy arrays created by values:

np.random.seed(456)
a = pd.date_range('2015-01-01', periods=6).values
j = [['A'] * 5 + ['B'] * 5 + ['C'] * 5, pd.to_datetime(np.random.choice(a, size=15))]
i = pd.MultiIndex.from_arrays(j, names=['Name','Num'])
df = pd.DataFrame(np.random.randn(15), i, columns=['Vals'])
df['SmallestNum'] = df.reset_index(level=1).groupby('Name')['Num'].transform('min').values
df['Diff'] = df.index.get_level_values(1).values - df['SmallestNum'].values
print (df)
                     Vals SmallestNum   Diff
Name Num                                    
A    2015-01-04 -1.842419  2015-01-02 2 days
     2015-01-06 -0.786788  2015-01-02 4 days
     2015-01-04  1.180140  2015-01-02 2 days
     2015-01-02  0.984257  2015-01-02 0 days
     2015-01-03  1.835646  2015-01-02 1 days
B    2015-01-05 -1.886823  2015-01-03 2 days
     2015-01-03  0.424763  2015-01-03 0 days
     2015-01-05 -0.433105  2015-01-03 2 days
     2015-01-06 -0.166838  2015-01-03 3 days
     2015-01-05  0.754634  2015-01-03 2 days
C    2015-01-06  1.966975  2015-01-02 4 days
     2015-01-06  0.200671  2015-01-02 4 days
     2015-01-05  0.742752  2015-01-02 3 days
     2015-01-02 -1.264271  2015-01-02 0 days
     2015-01-04 -0.112787  2015-01-02 2 days

Another solution:

df['Diff'] = (df.reset_index(level=1)
                .groupby('Name')['Num']
                .transform(lambda x: x - x.min())
                .values)
print (df)
                     Vals   Diff
Name Num                        
A    2015-01-04 -1.842419 2 days
     2015-01-06 -0.786788 4 days
     2015-01-04  1.180140 2 days
     2015-01-02  0.984257 0 days
     2015-01-03  1.835646 1 days
B    2015-01-05 -1.886823 2 days
     2015-01-03  0.424763 0 days
     2015-01-05 -0.433105 2 days
     2015-01-06 -0.166838 3 days
     2015-01-05  0.754634 2 days
C    2015-01-06  1.966975 4 days
     2015-01-06  0.200671 4 days
     2015-01-05  0.742752 3 days
     2015-01-02 -1.264271 0 days
     2015-01-04 -0.112787 2 days

edited Feb 25, 2018 at 20:35

answered Feb 25, 2018 at 20:09

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

feetwet Over a year ago

That works on the example in the OP, but it's failing on a bigger test case where the index level 1 is of type datetime64 with NotImplementedError: Index._join_level on non-unique index is not implemented. Let me see if I can generate a test case showing that....

feetwet Over a year ago

Yeah, this doesn't seem to work with time indices. I'm updating my example with the following, which causes what you show here to break:

j = [(a, b) for a in ['A','B','C'] for b in random.sample(pd.date_range('2017-01-01', periods=50, freq='W').tolist(), 5)]

jezrael Over a year ago

I simulate it, there are duplicated datetimes. Please give me some time.

jezrael Over a year ago

It seems as bug, but if convert to numpy array it working.

feetwet Over a year ago

Ah ha: I see that in my example. So it gets trickier when the index level has non-unique values! I eagerly await your insight!

|

Collectives™ on Stack Overflow

Pandas MultiIndex DataFrame reference index value in column calculation

1 Answer 1

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related