1

I have a database as the following:

enter image description here

And I would like to obtain a pandas dataframe filtered for the 2 rows per date, based on the top ones that have the highest population. The output should look like this:

enter image description here

I know that pandas offers a formula called nlargest: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.nlargest.html

but I don't think it is usable for this use case. Is there any workaround?

Thanks so much in advance!

3
  • Maybe you can sort_values (by ['Date', 'Population']) and use groupby (by 'Date') ? Commented Sep 11, 2020 at 15:14
  • 2
    Its better to paste the data as part of post not the image. it helps people to test on your data and put the right answer. It is bad practice anytime to put the code or data as image Commented Sep 11, 2020 at 15:17
  • @iraciv94 , if you like the answer then you also upvote it ✔😊 Commented Sep 12, 2020 at 16:57

1 Answer 1

2

I have mimicked your dataframe as below and provided a way forward to get the desired result.

Your Dataframe:

>>> df
        Date country  population
0 2019-12-31       A         100
1 2019-12-31       B          10
2 2019-12-31       C        1000
3 2020-01-01       A         200
4 2020-01-01       B          20
5 2020-01-01       C        3500
6 2020-01-01       D          12
7 2020-02-01       D        2000
8 2020-02-01       E          54

Your Desired Solution:

You can use nlargest method along with set_index ans groupby method.

This is what you will get..

>>> df.set_index('country').groupby('Date')['population'].nlargest(2)
Date        country
2019-12-31  C          1000
            A           100
2020-01-01  C          3500
            A           200
2020-02-01  D          2000
            E            54
Name: population, dtype: int64

Now, as you want the DataFrame into original state by resetting the index of the DataFrame, which will give you following ..

>>> df.set_index('country').groupby('Date')['population'].nlargest(2).reset_index()
        Date country  population
0 2019-12-31       C        1000
1 2019-12-31       A         100
2 2020-01-01       C        3500
3 2020-01-01       A         200
4 2020-02-01       D        2000
5 2020-02-01       E          54

Another way around:

With groupby and apply function use reset_index with parameter drop=True and level= ..

>>> df.groupby('Date').apply(lambda p: p.nlargest(2, columns='population')).reset_index(level=[0,1], drop=True)
  # df.groupby('Date').apply(lambda p: p.nlargest(2, columns='population')).reset_index(level=['Date',1], drop=True)
        Date country  population
0 2019-12-31       C        1000
1 2019-12-31       A         100
2 2020-01-01       C        3500
3 2020-01-01       A         200
4 2020-02-01       D        2000
5 2020-02-01       E          54
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.