2
df = pd.DataFrame({'x':[1,2,3,4,5,6],'y':[7,8,9,10,11,12],'z':['a','a','a','b','b','b']})
i = pd.Index([0,3,5,10,20])

The indices in i are from a larger dataframe, and df is a subset of that larger dataframe. So there will be indices in i that will not be in df. When I do

df.groupby('z').aggregate({'y':lambda x: sum(x.loc[i])}) #I know I can just use .aggregate({'y':sum}), this is just an example to illustrate my problem

I get this output

   y
z    
a NaN
b NaN

as well as a warning message

__main__:1: FutureWarning: 
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

How can I avoid this warning message and get the correct output? In my example the only valid indices for df are [0,3,5] so the expected output is:

   y
z    
a  7 #"sum" of index 0 
b  22 #sum of index [3,5]

EDIT

The answers here work great but they do not allow different types of aggregation of x and y columns. For example, let's say I want to sum all elements of x, but for y only sum the elements in index i:

df.groupby('z').aggregate({'x':sum, 'y': lambda x: sum(x.loc[i])})

this is the desired output:

   y   x                       
z    
a  7   6
b  22  15
5
  • 2
    What is expected output? Commented Feb 27, 2019 at 14:05
  • @jezrael: I have updated my question with the expected output Commented Feb 27, 2019 at 14:09
  • 2
    Index 5 is not in i. Commented Feb 27, 2019 at 14:10
  • 1
    df[df.index.isin(i)].groupby('z')['y'].sum() Commented Feb 27, 2019 at 14:12
  • @ScottBoston: Sorry, you are right. I have updated my question Commented Feb 27, 2019 at 14:12

2 Answers 2

3

Edit for updated question:

df.groupby('z').agg({'x':'sum','y':lambda r: r.reindex(i).sum()})

Output:

    x   y
z        
a   6   7
b  15  22

Use reindex, to only select those index from i, then dropna to remove all those nans from because indexes in i aren't in df. Then groupyby and agg:

df.reindex(i).dropna(how='all').groupby('z').agg({'y':'sum'})

or, you really don't need to dropna:

df.reindex(i).groupby('z').agg({'y':'sum'})

Output:

      y
z      
a   7.0
b  22.0
Sign up to request clarification or add additional context in comments.

4 Comments

This is great, thanks, but I don't really understand why this works. df.reindex(i) adds a row with 10 NaN NaN NaN, so why doesn't the aggregation return NaN as in my original problem?
Yes, groupby doesn't group on NaN value groups. However, you could use dropna with how='all' to remove those NaN records as I have shown in the first statement.
See SO Post stackoverflow.com/a/18431417/6361531 about missing values in groupby.
Thank you for the link. This solution works well, but it won't work for the edited problem in my question... Is there a workaround?
3

Use intersection with df.index and i for get only matched values and then procees data like need:

print (df.loc[df.index.intersection(i)])
   x   y  z
0  1   7  a
3  4  10  b
5  6  12  b

df = df.loc[df.index.intersection(i)].groupby('z').agg({'y':'sum'})
#comment alternative
#df = df.loc[df.index.isin(i)].groupby('z').agg({'y':'sum'})
print (df)
    y
z    
a   7
b  22

EDIT:

df1 = df.groupby('z').aggregate({'x':sum, 'y': lambda x: sum(x.loc[x.index.intersection(i)])})
#comment alternative
#df1 = df.groupby('z').aggregate({'x':sum, 'y': lambda x: sum(x.loc[x.index.isin(i)])})
print (df1)
    x   y
z        
a   6   7
b  15  22

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.