4

I'm using Pandas and Numpy on Python3 with the following versions:

  • Python 3.5.1 (via Anaconda 2.5.0) 64 bits
  • Pandas 0.19.1
  • Numpy 1.11.2 (probably not relevant here)

Here is the minimal code producing the problem:

import pandas as pd
import numpy as np

a = pd.DataFrame({'i' : [1,1,1,1,1], 'a': [1,2,5,6,100], 'b': [2, 4,10, np.nan, np.nan]})
a.set_index(keys='a', inplace=True)
v = a.groupby(level=0).apply(lambda x: x.sort_values(by='i')['b'].rolling(2, min_periods=0).mean())
v.index.names

This code is a simple groupby-apply, but I don't understand the outcome:

FrozenList(['a', 'a'])

For some reason, the index of the result is ['a', 'a'], which seems to be a very doubtful choice from pandas. I would have expected a simple ['a'].

Does anyone have some idea about why Pandas chooses to duplicate the column in the index?

Thanks in advance.

3
  • 1
    I think it's because of the call to sort_values this returns a new df so the index is being concatenated with the existing groupby index, you could argue that it shouldn't do this but normally it's expecting a scalar value to be returned, as a Series or DataFrame is being returned it looks like it's aligning and concatenating here Commented Feb 7, 2017 at 14:14
  • What are you trying to group by? Within your a.groupby() you should have the parameter as_index=False Commented Feb 7, 2017 at 14:14
  • @A.Kot you would get [None, 'a'] for the index names here when you pass index=False I think the OP is querying why there is are 2 levels of indices here, as well as the duplication of the index Commented Feb 7, 2017 at 14:16

1 Answer 1

1

This is happening because sort_values returns a DataFrame or Series so the index is being concatenated to the existing groupby index, the same thing happens if you did shift on the 'b' column:

In [99]:
v = a.groupby(level=0).apply(lambda x: x['b'].shift())
v

Out[99]:
a    a  
1    1     NaN
2    2     NaN
5    5     NaN
6    6     NaN
100  100   NaN
Name: b, dtype: float64

even with as_index=False it would still produce a multi-index:

In [102]:
v = a.groupby(level=0, as_index=False).apply(lambda x: x['b'].shift())
v

Out[102]:
   a  
0  1     NaN
1  2     NaN
2  5     NaN
3  6     NaN
4  100   NaN
Name: b, dtype: float64

if the lambda was returning a plain scalar value then no duplicating index is created:

In [104]:
v = a.groupby(level=0).apply(lambda x: x['b'].max())
v

Out[104]:
a
1       2.0
2       4.0
5      10.0
6       NaN
100     NaN
dtype: float64

I don't think this is a bug rather some semantics to be aware of that some methods will return an object where the index will be aligned with the pre-existing index.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.