I noticed that groupby().apply() produces different results for two groups that look identical, except that the overall DataFrame has duplicate index values.
Here is a minimal reproducible example:
import pandas as pd
df = pd.DataFrame({
'group': ['A','A','B','B','B'],
'value': [1,2,1,2,2]
}, index=[0,1,1,2,3]) # note the duplicate index: 1 appears twice
result = df.groupby('group').apply(lambda g: g)
print(result)
Output:
group value
group
A A 1
A 2
B B 1
B 2
B 2
But when I reset the index so it becomes unique:
df2 = df.reset_index(drop=True)
result2 = df2.groupby('group').apply(lambda g: g)
print(result2)
I get a different structure (especially inside the B group).
Why does the presence of duplicate index values change how groupby().apply() constructs the returned index?What is the correct way to preserve the original rows and avoid unexpected index nesting when applying functions?
groupbycolumn becomes the level 0 index and the original index becomes the level 1 index..apply(lambda g: g)doesn't do anything interesting, obviously, so were you trying to do something more useful when you noticed this behaviour?DeprecationWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.Are you getting that too? If so, you should explicitly acknowledge it, like addinclude_groups=Falseas it says.