pandas groupby, then sort within groups

Question

I want to group my dataframe by two columns and then sort the aggregated results within those groups.

In [167]: df

Out[167]:
   count     job source
0      2   sales      A
1      4   sales      B
2      6   sales      C
3      3   sales      D
4      7   sales      E
5      5  market      A
6      3  market      B
7      2  market      C
8      4  market      D
9      1  market      E


In [168]: df.groupby(['job','source']).agg({'count':sum})

Out[168]:
               count
job    source       
market A           5
       B           3
       C           2
       D           4
       E           1
sales  A           2
       B           4
       C           6
       D           3
       E           7

I would now like to sort the 'count' column in descending order within each of the groups, and then take only the top three rows. To get something like:

                count
job     source
market  A           5
        D           4
        B           3
sales   E           7
        C           6
        B           4

The reason this is tricky in pandas is when you groupby more than one group, the intermediate (grouper) object gets a multiindex containing those groups, and the original index is dropped. Unless you override the default groupby(... as_index=True) — smci
– smci, Commented Jun 16, 2022 at 0:39

tvashtar · Accepted Answer · 2017-07-05 10:54:22Z

333

You could also just do it in one go, by doing the sort first and using head to take the first 3 of each group.

In[34]: df.sort_values(['job','count'],ascending=False).groupby('job').head(3)

Out[35]: 
   count     job source
4      7   sales      E
2      6   sales      C
1      4   sales      B
5      5  market      A
8      4  market      D
6      3  market      B

edited Jul 5, 2017 at 10:54

answered Mar 18, 2016 at 1:20

tvashtar

4,3251 gold badge16 silver badges12 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

toto_tico Over a year ago

Does groupby guarantees that the order is preserved?

toto_tico Over a year ago

It seems it does; from the documentation of groupby: groupby preserves the order of rows within each group

brian_ds Over a year ago

toto_tico- That is correct, however care needs to be taken in interpreting that statement. The order of rows WITHIN A SINGLE GROUP are preserved, however groupby has a sort=True statement by default which means the groups themselves may have been sorted on the key. In other words if my dataframe has keys (on input) 3 2 2 1,.. the group by object will shows the 3 groups in the order 1 2 3 (sorted). Use sort=False to make sure group order and row order are preserved.

Nabin Over a year ago

head(3) gives more than 3 results?

Zvi Over a year ago

I don't understand why this got most of the votes, while it did not take care of the sum() of the 'count'. If one adds an extra line with the values ('sales', 'A', 6) one can see that this solution does not add the 2 + 6 of ('sales', 'A') which is 8 and should be the first line of the result.

|

joris · Accepted Answer · 2020-10-28 07:22:29Z

236

What you want to do is actually again a groupby (on the result of the first groupby): sort and take the first three elements per group.

Starting from the result of the first groupby:

In [60]: df_agg = df.groupby(['job','source']).agg({'count':sum})

We group by the first level of the index:

In [63]: g = df_agg['count'].groupby('job', group_keys=False)

Then we want to sort ('order') each group and take the first three elements:

In [64]: res = g.apply(lambda x: x.sort_values(ascending=False).head(3))

However, for this, there is a shortcut function to do this, nlargest:

In [65]: g.nlargest(3)
Out[65]:
job     source
market  A         5
        D         4
        B         3
sales   E         7
        C         6
        B         4
dtype: int64

So in one go, this looks like:

df_agg['count'].groupby('job', group_keys=False).nlargest(3)

edited Oct 28, 2020 at 7:22

answered Jan 8, 2015 at 15:46

joris

140k37 gold badges258 silver badges207 bronze badges

7 Comments

JoeDanger Over a year ago

Would there be a way to sum up everything that isn't contained in the top three results per group and add them to a source group called "other" for each job?

Bowen Liu Over a year ago

Thanks for the great answer. For a further step, would there be a way to assign the sorting order based on values in the groupby column? For instance, sort ascending if the value is 'Buy' and sort descending if the value is 'Sell'.

mcp Over a year ago

It might be easier to just use as_index=False to create a normal data frame and then sort as normal.

joris Over a year ago

@young_souvlaki you still need a groupby operation to take only the first 3 per group, that's not possible with a normal sort

mcp Over a year ago

@joris as_index is a groupby parameter. Are we on the same page?

|

Surya Chhetri · Accepted Answer · 2017-06-11 23:28:16Z

Here's other example of taking top 3 on sorted order, and sorting within the groups:

In [43]: import pandas as pd                                                                                                                                                       

In [44]:  df = pd.DataFrame({"name":["Foo", "Foo", "Baar", "Foo", "Baar", "Foo", "Baar", "Baar"], "count_1":[5,10,12,15,20,25,30,35], "count_2" :[100,150,100,25,250,300,400,500]})

In [45]: df                                                                                                                                                                        
Out[45]: 
   count_1  count_2  name
0        5      100   Foo
1       10      150   Foo
2       12      100  Baar
3       15       25   Foo
4       20      250  Baar
5       25      300   Foo
6       30      400  Baar
7       35      500  Baar


### Top 3 on sorted order:
In [46]: df.groupby(["name"])["count_1"].nlargest(3)                                                                                                                               
Out[46]: 
name   
Baar  7    35
      6    30
      4    20
Foo   5    25
      3    15
      1    10
dtype: int64


### Sorting within groups based on column "count_1":
In [48]: df.groupby(["name"]).apply(lambda x: x.sort_values(["count_1"], ascending = False)).reset_index(drop=True)
Out[48]: 
   count_1  count_2  name
0       35      500  Baar
1       30      400  Baar
2       20      250  Baar
3       12      100  Baar
4       25      300   Foo
5       15       25   Foo
6       10      150   Foo
7        5      100   Foo

Kaveh · Accepted Answer · 2021-08-02 21:21:56Z

34

Try this Instead, which is a simple way to do groupby and sorting in descending order:

df.groupby(['companyName'])['overallRating'].sum().sort_values(ascending=False).head(20)

edited Aug 2, 2021 at 21:21

Kaveh

5,0112 gold badges23 silver badges36 bronze badges

answered Mar 6, 2020 at 9:54

sscswapnil

7577 silver badges6 bronze badges

Comments

Ted Petrou · Accepted Answer · 2017-11-04 16:17:01Z

18

If you don't need to sum a column, then use @tvashtar's answer. If you do need to sum, then you can use @joris' answer or this one which is very similar to it.

df.groupby(['job']).apply(lambda x: (x.groupby('source')
                                      .sum()
                                      .sort_values('count', ascending=False))
                                     .head(3))

answered Nov 4, 2017 at 16:17

Ted Petrou

62.4k19 gold badges139 silver badges139 bronze badges

Comments

smci · Accepted Answer · 2022-06-16 00:18:50Z

3

When grouped dataframe contains more than one grouped column ("multi-index"), using other methods erases other columns:

edf = pd.DataFrame({"job":["sales", "sales", "sales", "sales", "sales",
                           "market", "market", "market", "market", "market"],
                    "source":["A", "B", "C", "D", "E", "A", "B", "C", "D", "E"],
                    "count":[2, 4,6,3,7,5,3,2,4,1],
                    "other_col":[1,2,3,4,56,6,3,4,6,11]})

gdf = edf.groupby(["job", "source"]).agg({"count":sum, "other_col":np.mean})
gdf.groupby(level=0, group_keys=False).apply(lambda g:g.sort_values("count", ascending=False))

This keeps other_col as well as ordering by count column within each group

edited Jun 16, 2022 at 0:18

smci

34.2k21 gold badges118 silver badges152 bronze badges

answered Oct 6, 2021 at 3:12

haneulkim

5,01812 gold badges54 silver badges106 bronze badges

1 Comment

dspractician Over a year ago

IS there a way to not get sum of count column and pass count column as it is?

parvaneh shayegh · Accepted Answer · 2021-05-04 12:10:28Z

1

I was getting this error without using "by":

TypeError: sort_values() missing 1 required positional argument: 'by'

So, I changed it to this and now it's working:

df.groupby(['job','source']).agg({'count':sum}).sort_values(by='count',ascending=False).head(20)

answered May 4, 2021 at 12:10

parvaneh shayegh

5581 gold badge6 silver badges13 bronze badges

Comments

Gourav Parmar · Accepted Answer · 2021-05-13 12:53:33Z

1

@joris answer helped a lot. This is what worked for me.

df.groupby(['job'])['count'].nlargest(3)

answered May 13, 2021 at 12:53

Gourav Parmar

212 bronze badges

Comments

pulkit khandelwal · Accepted Answer · 2020-09-27 18:58:00Z

0

You can do it in one line -

df.groupby(['job']).apply(lambda x: x.sort_values(['count'], ascending=False).head(3)
.drop('job', axis=1))

what apply() does is that it takes each group of groupby and assigns it to the x in lambda function.

answered Sep 27, 2020 at 18:58

pulkit khandelwal

11

Collectives™ on Stack Overflow

pandas groupby, then sort within groups

9 Answers 9

7 Comments

7 Comments

Comments

Comments

Comments

1 Comment

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

9 Answers 9

7 Comments

7 Comments

Comments

Comments

Comments

1 Comment

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related