Python/Pandas: Unexpected indices when doing a groupby-apply

Question

I'm using Pandas and Numpy on Python3 with the following versions:

Python 3.5.1 (via Anaconda 2.5.0) 64 bits
Pandas 0.19.1
Numpy 1.11.2 (probably not relevant here)

Here is the minimal code producing the problem:

import pandas as pd
import numpy as np

a = pd.DataFrame({'i' : [1,1,1,1,1], 'a': [1,2,5,6,100], 'b': [2, 4,10, np.nan, np.nan]})
a.set_index(keys='a', inplace=True)
v = a.groupby(level=0).apply(lambda x: x.sort_values(by='i')['b'].rolling(2, min_periods=0).mean())
v.index.names

This code is a simple groupby-apply, but I don't understand the outcome:

FrozenList(['a', 'a'])

For some reason, the index of the result is ['a', 'a'], which seems to be a very doubtful choice from pandas. I would have expected a simple ['a'].

Does anyone have some idea about why Pandas chooses to duplicate the column in the index?

Thanks in advance.

I think it's because of the call to sort_values this returns a new df so the index is being concatenated with the existing groupby index, you could argue that it shouldn't do this but normally it's expecting a scalar value to be returned, as a Series or DataFrame is being returned it looks like it's aligning and concatenating here — EdChum
– EdChum, Commented Feb 7, 2017 at 14:14
What are you trying to group by? Within your a.groupby() you should have the parameter as_index=False — A.Kot
– A.Kot, Commented Feb 7, 2017 at 14:14
@A.Kot you would get [None, 'a'] for the index names here when you pass index=False I think the OP is querying why there is are 2 levels of indices here, as well as the duplication of the index — EdChum
– EdChum, Commented Feb 7, 2017 at 14:16

EdChum · Accepted Answer · 2017-02-07 14:27:02Z

This is happening because sort_values returns a DataFrame or Series so the index is being concatenated to the existing groupby index, the same thing happens if you did shift on the 'b' column:

In [99]:
v = a.groupby(level=0).apply(lambda x: x['b'].shift())
v

Out[99]:
a    a  
1    1     NaN
2    2     NaN
5    5     NaN
6    6     NaN
100  100   NaN
Name: b, dtype: float64

even with as_index=False it would still produce a multi-index:

In [102]:
v = a.groupby(level=0, as_index=False).apply(lambda x: x['b'].shift())
v

Out[102]:
   a  
0  1     NaN
1  2     NaN
2  5     NaN
3  6     NaN
4  100   NaN
Name: b, dtype: float64

if the lambda was returning a plain scalar value then no duplicating index is created:

In [104]:
v = a.groupby(level=0).apply(lambda x: x['b'].max())
v

Out[104]:
a
1       2.0
2       4.0
5      10.0
6       NaN
100     NaN
dtype: float64

I don't think this is a bug rather some semantics to be aware of that some methods will return an object where the index will be aligned with the pre-existing index.

Collectives™ on Stack Overflow

Python/Pandas: Unexpected indices when doing a groupby-apply

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related