summing up the values in a column from groupedby dataframe in pandas

Question

Here is my pandas.DataFrame:

I want to create a new DataFrame that will contain the data grouped by 'a' and contains the sum of the largest 3 values for each group.

Here is the output I expect. The largest 3 values of 'b' for group 1 are 7,5 and 4, and for group 2 are 7, 6 and 5.

a
1  16
2  18

df.groupby('a')['b'].nlargest(3)

gives me this output,

and

  df.groupby('a')['b'].nlargest(3).sum()

gives me the total sum 34 (16+18).

How can I get the expected output with pandas.DataFrame?

Thank you!

Zero · Accepted Answer · 2016-12-16 15:04:23Z

2

Using apply is one way to do it.

In [41]: df.groupby('a')['b'].apply(lambda x: x.nlargest(3).sum())
Out[41]:
a
1    16
2    18
Name: b, dtype: int64

Timings

In [42]: dff = pd.concat([df]*1000).reset_index(drop=True)

In [43]: dff.shape
Out[43]: (11000, 2)

In [44]: %timeit dff.groupby('a')['b'].apply(lambda x: x.nlargest(3).sum())
100 loops, best of 3: 2.44 ms per loop

In [45]: %timeit dff.groupby('a')['b'].nlargest(3).groupby(level='a').sum()
100 loops, best of 3: 3.44 ms per loop

edited Dec 16, 2016 at 15:04

answered Dec 16, 2016 at 14:57

Zero

77.4k22 gold badges153 silver badges153 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

jezrael Over a year ago

Interesting, it seems timings depends of size of groups, see my timings. Can you test it in your pc, please?

jezrael Over a year ago

It seems if groups are large - you have only 2 large groups and me 15 apply is faster.

Zero Over a year ago

You're right. I got 138 vs 140ms. apply performing marginally better for your test case.

jezrael Over a year ago

138 vs 140 I think it is same timing ;) So both solution are nice I think.

Ivo Over a year ago

Thank you John. If I have longer list of the output, how can I sort it out and select the top three?

Community · Accepted Answer · 2017-05-23 10:30:40Z

0

Use double groupby - second by level a of MultiIndex:

s = df.groupby('a')['b'].nlargest(3).groupby(level='a').sum()
print (s)
a
1    16
2    18
Name: b, dtype: int64

But for me is nicer:

df.groupby('a')['b'].nlargest(3).sum(level=0)

thank you Nickil Maveli.

EDIT: If need top 3 again, use Series.nlargest:

df = pd.DataFrame({'a': [1, 1, 2, 3, 2, 2, 1, 3, 4, 3, 4],
                   'b': [5, 7, 3, 3, 5, 6, 4, 3, 7, 4, 5]})

print (df)
    a  b
0   1  5
1   1  7
2   2  3
3   3  3
4   2  5
5   2  6
6   1  4
7   3  3
8   4  7
9   3  4
10  4  5


df = df.groupby('a')['b'].nlargest(3).sum(level=0).nlargest(3)
print (df)
a
1    16
2    14
4    12
Name: b, dtype: int64

Timings:

np.random.seed(123)
N = 1000000

L2 = np.arange(100)

df = pd.DataFrame({'b':np.random.randint(20, size=N), 
                   'a': np.random.choice(L2, N)})

print (df)

In [22]: %timeit df.groupby('a')['b'].apply(lambda x: x.nlargest(3).sum())
10 loops, best of 3: 125 ms per loop

In [23]: %timeit df.groupby('a')['b'].nlargest(3).groupby(level='a').sum()
10 loops, best of 3: 121 ms per loop

In [29]: %timeit df.groupby('a')['b'].nlargest(3).sum(level=0)
10 loops, best of 3: 121 ms per loop

np.random.seed(123)
N = 1000000

L2 = list('abcdefghijklmno')

df = pd.DataFrame({'b':np.random.randint(20, size=N), 
                   'a': np.random.choice(L2, N)})

print (df)

In [19]: %timeit df.groupby('a')['b'].apply(lambda x: x.nlargest(3).sum())
10 loops, best of 3: 97.9 ms per loop

In [20]: %timeit df.groupby('a')['b'].nlargest(3).groupby(level='a').sum()
10 loops, best of 3: 96.5 ms per loop

In [31]: %timeit df.groupby('a')['b'].nlargest(3).sum(level=0)
10 loops, best of 3: 97.9 ms per loop

np.random.seed(123)
N = 1000000

L2 = list('abcde')

df = pd.DataFrame({'b':np.random.randint(20, size=N), 
                   'a': np.random.choice(L2, N)})

print (df)


In [25]: %timeit df.groupby('a')['b'].apply(lambda x: x.nlargest(3).sum())
10 loops, best of 3: 82 ms per loop

In [26]: %timeit df.groupby('a')['b'].nlargest(3).groupby(level='a').sum()
10 loops, best of 3: 81.9 ms per loop

In [33]: %timeit df.groupby('a')['b'].nlargest(3).sum(level=0)
10 loops, best of 3: 82.5 ms per loop

edited May 23, 2017 at 10:30

CommunityBot

11 silver badge

answered Dec 16, 2016 at 14:58

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

4 Comments

Zero Over a year ago

This is useful. Y̶o̶u̶ ̶m̶i̶g̶h̶t̶ ̶b̶e̶ ̶i̶n̶t̶e̶r̶e̶s̶t̶e̶d̶ ̶i̶n̶ ̶t̶h̶e̶ ̶t̶i̶m̶i̶n̶g̶s̶ Ah, you already have them. Neat.

Nickil Maveli Over a year ago

df.groupby('a')['b'].nlargest(3).sum(level=0) - slightly faster

jezrael Over a year ago

@NickilMaveli -very nice code I think, I never see it before. Thanks.

jezrael Over a year ago

@Ivo - I add solution for top 3 of output, please check it.

Collectives™ on Stack Overflow

summing up the values in a column from groupedby dataframe in pandas

2 Answers 2

5 Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related