2

I have the following dataframe:

df = pd.DataFrame({'place'     : ['A', 'B', 'C', 'D', 'E', 'F'],
               'population': [10 , 20, 30, 15, 25, 35],
               'region': ['I', 'II', 'III', 'I', 'II', 'III']})

And it looks like this:

      place  population region
0     A          10      I
1     B          20     II
2     C          30    III
3     D          15      I
4     E          25     II
5     F          35    III

I would like to select the place with the smallest population from the region with the highest population.

df.groupby('region').population.sum()

Returns:

region
I      25
II     45
III    65
Name: population, dtype: int64

But I have no clue how to proceed from here (using .groupby / .loc / .iloc)

Any suggestion?

2 Answers 2

6

First add a column for region population:

df['region_pop'] = df.groupby('region')['population'].transform(sum)

Then sort your dataframe and extract the first row:

res = df.sort_values(['region_pop', 'population'], ascending=[False, True])\
        .head(1)

Result:

  place  population region  region_pop
2     C          30    III          65
Sign up to request clarification or add additional context in comments.

5 Comments

I believe this would be faster than my solution. +1 :)
Thanks, nice! Is there a way to do it in one line of code (with method chaining)?
@Rene, Probably, but it'll be an unreadable mess I would have difficulty understanding.
df.assign(region_population = df.groupby('region')['population'].transform(sum)).sort_values(['region_population', 'population'], ascending=[False, True]).iloc[0].place
@Rene, Yep, that would be the one-liner. But don't let it make you believe it's more efficient. You are just moving an explicit series definition to pd.DataFrame.assign.
1

You need to find the region with highest population. Then groupby place to the subset of data with that region and find the place with lowest population. (Assuming place would be repetitive in real data)

high_reg = df.groupby('region')['population'].sum().reset_index(name='count').sort_values('count').iloc[-1]['region']
df.loc[df['region']==high_reg].groupby('place')['population'].sum().reset_index(name='count').sort_values('count').iloc[0]['place']

Output:

'C'

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.