Select top n columns based on another column

Question

I have a database as the following:

And I would like to obtain a pandas dataframe filtered for the 2 rows per date, based on the top ones that have the highest population. The output should look like this:

I know that pandas offers a formula called nlargest: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.nlargest.html

but I don't think it is usable for this use case. Is there any workaround?

Thanks so much in advance!

Maybe you can sort_values (by ['Date', 'Population']) and use groupby (by 'Date') ? — doca
– doca, Commented Sep 11, 2020 at 15:14
Its better to paste the data as part of post not the image. it helps people to test on your data and put the right answer. It is bad practice anytime to put the code or data as image — Dev
– Dev, Commented Sep 11, 2020 at 15:17
@iraciv94 , if you like the answer then you also upvote it ✔😊 — Karn Kumar
– Karn Kumar, Commented Sep 12, 2020 at 16:57

halfer · Accepted Answer · 2023-11-30 00:42:04Z

I have mimicked your dataframe as below and provided a way forward to get the desired result.

Your Dataframe:

>>> df
        Date country  population
0 2019-12-31       A         100
1 2019-12-31       B          10
2 2019-12-31       C        1000
3 2020-01-01       A         200
4 2020-01-01       B          20
5 2020-01-01       C        3500
6 2020-01-01       D          12
7 2020-02-01       D        2000
8 2020-02-01       E          54

Your Desired Solution:

You can use nlargest method along with set_index ans groupby method.

This is what you will get..

>>> df.set_index('country').groupby('Date')['population'].nlargest(2)
Date        country
2019-12-31  C          1000
            A           100
2020-01-01  C          3500
            A           200
2020-02-01  D          2000
            E            54
Name: population, dtype: int64

Now, as you want the DataFrame into original state by resetting the index of the DataFrame, which will give you following ..

>>> df.set_index('country').groupby('Date')['population'].nlargest(2).reset_index()
        Date country  population
0 2019-12-31       C        1000
1 2019-12-31       A         100
2 2020-01-01       C        3500
3 2020-01-01       A         200
4 2020-02-01       D        2000
5 2020-02-01       E          54

Another way around:

With groupby and apply function use reset_index with parameter drop=True and level= ..

>>> df.groupby('Date').apply(lambda p: p.nlargest(2, columns='population')).reset_index(level=[0,1], drop=True)
  # df.groupby('Date').apply(lambda p: p.nlargest(2, columns='population')).reset_index(level=['Date',1], drop=True)
        Date country  population
0 2019-12-31       C        1000
1 2019-12-31       A         100
2 2020-01-01       C        3500
3 2020-01-01       A         200
4 2020-02-01       D        2000
5 2020-02-01       E          54

Collectives™ on Stack Overflow

Select top n columns based on another column

1 Answer 1

Your Dataframe:

Your Desired Solution:

Another way around:

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Your Dataframe:

Your Desired Solution:

Another way around:

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related