7

I have a pandas DataFrame with a MultiIndex of columns:

columns=pd.MultiIndex.from_tuples(
    [(c, i) for c in ['a', 'b'] for i in range(3)])
df = pd.DataFrame(np.random.randn(4, 6),
                  index=[0, 0, 1, 1],
                  columns=columns)
print(df)

          a                             b                    
          0         1         2         0         1         2
0  0.582804  0.753118 -0.900950 -0.914657 -0.333091 -0.965912
0  0.498002 -0.842624  0.155783  0.559730 -0.300136 -1.211412
1  0.727019  1.522160  1.679025  1.738350  0.593361  0.411907
1  1.253759 -0.806279 -2.177582 -0.099210 -0.839822 -0.211349

I want to group by the index, and use the 'min' aggregation on the a columns, and the 'sum' aggregation on the b columns.

I know I can do this by creating a dict that specifies the agg function for each column:

agg_dict = {'a': 'min', 'b': 'sum'}
full_agg_dict = {(c, i): agg_dict[c] for c in ['a', 'b'] for i in range(3)}
print(df.groupby(level=0).agg(full_agg_dict))

          a                             b                    
          0         1         2         0         1         2
0  0.498002 -0.842624 -0.900950 -0.354927 -0.633227 -2.177324
1  0.727019 -0.806279 -2.177582  1.639140 -0.246461  0.200558

Is there a simpler way? It seems like there should be a way to do this with agg_dict without using full_agg_dict.

1
  • 3
    I dont know if there's anything simpler. Probably just make the dictionary more flexible (and easier to read) if it doesn't follow a perfect pattern: {x: agg_dict[x[0]] for x in df.columns} Commented Sep 5, 2019 at 18:50

1 Answer 1

2

I would use your approach as well. But here's another way that (should) work:

(df.stack(level=1)
   .groupby(level=[0,1])
   .agg({'a':'min','b':'sum'})
   .unstack(-1)
)

For some reason groupby(level=[0,1] doesn't work for me, so I came up with:

(df.stack(level=1)
   .reset_index()
   .groupby(['level_0','level_1'])
   .agg({'a':'min','b':'sum'})
   .unstack('level_1')
)
Sign up to request clarification or add additional context in comments.

3 Comments

Reason the first solution does not work for this dataframe is because of the original dataframe's index is a 1D array of shape (4,). If instead a 2D array (index=[[0, 0, 1, 1]]) was passed, it would work just fine.
@ALollz agreed. That's what I said at the very beginning as well.
@SIA it's a bug with stack. Codes are created incorrectly (which are then used in the groupby) when the index has duplicate values. stack currently just uses new_codes = [np.arange(N).repeat(levsize)] to generate new codes, which ignores dupes.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.