aggregation with indices not present in dataframe

Question

df = pd.DataFrame({'x':[1,2,3,4,5,6],'y':[7,8,9,10,11,12],'z':['a','a','a','b','b','b']})
i = pd.Index([0,3,5,10,20])

The indices in i are from a larger dataframe, and df is a subset of that larger dataframe. So there will be indices in i that will not be in df. When I do

df.groupby('z').aggregate({'y':lambda x: sum(x.loc[i])}) #I know I can just use .aggregate({'y':sum}), this is just an example to illustrate my problem

I get this output

   y
z    
a NaN
b NaN

as well as a warning message

__main__:1: FutureWarning: 
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

How can I avoid this warning message and get the correct output? In my example the only valid indices for df are [0,3,5] so the expected output is:

   y
z    
a  7 #"sum" of index 0 
b  22 #sum of index [3,5]

EDIT

The answers here work great but they do not allow different types of aggregation of x and y columns. For example, let's say I want to sum all elements of x, but for y only sum the elements in index i:

df.groupby('z').aggregate({'x':sum, 'y': lambda x: sum(x.loc[i])})

this is the desired output:

@jezrael: I have updated my question with the expected output — HappyPy
– HappyPy, Commented Feb 27, 2019 at 14:09
@ScottBoston: Sorry, you are right. I have updated my question — HappyPy
– HappyPy, Commented Feb 27, 2019 at 14:12

Community · Accepted Answer · 2020-06-20 09:12:55Z

3

Edit for updated question:

df.groupby('z').agg({'x':'sum','y':lambda r: r.reindex(i).sum()})

Output:

Use reindex, to only select those index from i, then dropna to remove all those nans from because indexes in i aren't in df. Then groupyby and agg:

df.reindex(i).dropna(how='all').groupby('z').agg({'y':'sum'})

or, you really don't need to dropna:

df.reindex(i).groupby('z').agg({'y':'sum'})

Output:

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered Feb 27, 2019 at 14:06

Scott Boston

154k15 gold badges160 silver badges207 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

HappyPy Over a year ago

This is great, thanks, but I don't really understand why this works. df.reindex(i) adds a row with 10 NaN NaN NaN, so why doesn't the aggregation return NaN as in my original problem?

Scott Boston Over a year ago

Yes, groupby doesn't group on NaN value groups. However, you could use dropna with how='all' to remove those NaN records as I have shown in the first statement.

Scott Boston Over a year ago

See SO Post stackoverflow.com/a/18431417/6361531 about missing values in groupby.

HappyPy Over a year ago

Thank you for the link. This solution works well, but it won't work for the edited problem in my question... Is there a workaround?

jezrael · Accepted Answer · 2019-02-27 14:38:36Z

3

Use intersection with df.index and i for get only matched values and then procees data like need:

print (df.loc[df.index.intersection(i)])
   x   y  z
0  1   7  a
3  4  10  b
5  6  12  b

df = df.loc[df.index.intersection(i)].groupby('z').agg({'y':'sum'})
#comment alternative
#df = df.loc[df.index.isin(i)].groupby('z').agg({'y':'sum'})
print (df)
    y
z    
a   7
b  22

EDIT:

df1 = df.groupby('z').aggregate({'x':sum, 'y': lambda x: sum(x.loc[x.index.intersection(i)])})
#comment alternative
#df1 = df.groupby('z').aggregate({'x':sum, 'y': lambda x: sum(x.loc[x.index.isin(i)])})
print (df1)
    x   y
z        
a   6   7
b  15  22

edited Feb 27, 2019 at 14:38

answered Feb 27, 2019 at 14:09

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Collectives™ on Stack Overflow

aggregation with indices not present in dataframe

2 Answers 2

Edit for updated question:

4 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Edit for updated question:

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related