0

Here is my pandas.DataFrame:

    a  b
0   1  5
1   1  7
2   2  3
3   1  3
4   2  5
5   2  6
6   1  4
7   1  3
8   2  7
9   2  4
10   2  5

I want to create a new DataFrame that will contain the data grouped by 'a' and contains the sum of the largest 3 values for each group.

Here is the output I expect. The largest 3 values of 'b' for group 1 are 7,5 and 4, and for group 2 are 7, 6 and 5.

a
1  16
2  18

df.groupby('a')['b'].nlargest(3)

gives me this output,

 a    
 1  1     7
    0     5
    6     4
 2  8     7
    5     6
    10    5

and

  df.groupby('a')['b'].nlargest(3).sum()

gives me the total sum 34 (16+18).

How can I get the expected output with pandas.DataFrame?

Thank you!

2 Answers 2

2

Using apply is one way to do it.

In [41]: df.groupby('a')['b'].apply(lambda x: x.nlargest(3).sum())
Out[41]:
a
1    16
2    18
Name: b, dtype: int64

Timings

In [42]: dff = pd.concat([df]*1000).reset_index(drop=True)

In [43]: dff.shape
Out[43]: (11000, 2)

In [44]: %timeit dff.groupby('a')['b'].apply(lambda x: x.nlargest(3).sum())
100 loops, best of 3: 2.44 ms per loop

In [45]: %timeit dff.groupby('a')['b'].nlargest(3).groupby(level='a').sum()
100 loops, best of 3: 3.44 ms per loop
Sign up to request clarification or add additional context in comments.

5 Comments

Interesting, it seems timings depends of size of groups, see my timings. Can you test it in your pc, please?
It seems if groups are large - you have only 2 large groups and me 15 apply is faster.
You're right. I got 138 vs 140ms. apply performing marginally better for your test case.
138 vs 140 I think it is same timing ;) So both solution are nice I think.
Thank you John. If I have longer list of the output, how can I sort it out and select the top three?
0

Use double groupby - second by level a of MultiIndex:

s = df.groupby('a')['b'].nlargest(3).groupby(level='a').sum()
print (s)
a
1    16
2    18
Name: b, dtype: int64

But for me is nicer:

df.groupby('a')['b'].nlargest(3).sum(level=0)

thank you Nickil Maveli.

EDIT: If need top 3 again, use Series.nlargest:

df = pd.DataFrame({'a': [1, 1, 2, 3, 2, 2, 1, 3, 4, 3, 4],
                   'b': [5, 7, 3, 3, 5, 6, 4, 3, 7, 4, 5]})

print (df)
    a  b
0   1  5
1   1  7
2   2  3
3   3  3
4   2  5
5   2  6
6   1  4
7   3  3
8   4  7
9   3  4
10  4  5


df = df.groupby('a')['b'].nlargest(3).sum(level=0).nlargest(3)
print (df)
a
1    16
2    14
4    12
Name: b, dtype: int64

Timings:

np.random.seed(123)
N = 1000000

L2 = np.arange(100)

df = pd.DataFrame({'b':np.random.randint(20, size=N), 
                   'a': np.random.choice(L2, N)})

print (df)

In [22]: %timeit df.groupby('a')['b'].apply(lambda x: x.nlargest(3).sum())
10 loops, best of 3: 125 ms per loop

In [23]: %timeit df.groupby('a')['b'].nlargest(3).groupby(level='a').sum()
10 loops, best of 3: 121 ms per loop

In [29]: %timeit df.groupby('a')['b'].nlargest(3).sum(level=0)
10 loops, best of 3: 121 ms per loop

np.random.seed(123)
N = 1000000

L2 = list('abcdefghijklmno')

df = pd.DataFrame({'b':np.random.randint(20, size=N), 
                   'a': np.random.choice(L2, N)})

print (df)

In [19]: %timeit df.groupby('a')['b'].apply(lambda x: x.nlargest(3).sum())
10 loops, best of 3: 97.9 ms per loop

In [20]: %timeit df.groupby('a')['b'].nlargest(3).groupby(level='a').sum()
10 loops, best of 3: 96.5 ms per loop

In [31]: %timeit df.groupby('a')['b'].nlargest(3).sum(level=0)
10 loops, best of 3: 97.9 ms per loop

np.random.seed(123)
N = 1000000

L2 = list('abcde')

df = pd.DataFrame({'b':np.random.randint(20, size=N), 
                   'a': np.random.choice(L2, N)})

print (df)


In [25]: %timeit df.groupby('a')['b'].apply(lambda x: x.nlargest(3).sum())
10 loops, best of 3: 82 ms per loop

In [26]: %timeit df.groupby('a')['b'].nlargest(3).groupby(level='a').sum()
10 loops, best of 3: 81.9 ms per loop

In [33]: %timeit df.groupby('a')['b'].nlargest(3).sum(level=0)
10 loops, best of 3: 82.5 ms per loop

4 Comments

This is useful. Y̶o̶u̶ ̶m̶i̶g̶h̶t̶ ̶b̶e̶ ̶i̶n̶t̶e̶r̶e̶s̶t̶e̶d̶ ̶i̶n̶ ̶t̶h̶e̶ ̶t̶i̶m̶i̶n̶g̶s̶ Ah, you already have them. Neat.
df.groupby('a')['b'].nlargest(3).sum(level=0) - slightly faster
@NickilMaveli -very nice code I think, I never see it before. Thanks.
@Ivo - I add solution for top 3 of output, please check it.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.