select index value from groupby on a pandas dataframe in python

Question

I have the following dataframe:

df = pd.DataFrame({'place'     : ['A', 'B', 'C', 'D', 'E', 'F'],
               'population': [10 , 20, 30, 15, 25, 35],
               'region': ['I', 'II', 'III', 'I', 'II', 'III']})

And it looks like this:

      place  population region
0     A          10      I
1     B          20     II
2     C          30    III
3     D          15      I
4     E          25     II
5     F          35    III

I would like to select the place with the smallest population from the region with the highest population.

df.groupby('region').population.sum()

Returns:

region
I      25
II     45
III    65
Name: population, dtype: int64

But I have no clue how to proceed from here (using .groupby / .loc / .iloc)

Any suggestion?

jpp · Accepted Answer · 2018-06-20 14:59:11Z

6

First add a column for region population:

df['region_pop'] = df.groupby('region')['population'].transform(sum)

Then sort your dataframe and extract the first row:

res = df.sort_values(['region_pop', 'population'], ascending=[False, True])\
        .head(1)

Result:

  place  population region  region_pop
2     C          30    III          65

answered Jun 20, 2018 at 14:59

jpp

166k37 gold badges301 silver badges363 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

harpan Over a year ago

I believe this would be faster than my solution. +1 :)

René Over a year ago

Thanks, nice! Is there a way to do it in one line of code (with method chaining)?

jpp Over a year ago

@Rene, Probably, but it'll be an unreadable mess I would have difficulty understanding.

René Over a year ago

df.assign(region_population = df.groupby('region')['population'].transform(sum)).sort_values(['region_population', 'population'], ascending=[False, True]).iloc[0].place

jpp Over a year ago

@Rene, Yep, that would be the one-liner. But don't let it make you believe it's more efficient. You are just moving an explicit series definition to pd.DataFrame.assign.

harpan · Accepted Answer · 2018-06-20 14:57:03Z

1

You need to find the region with highest population. Then groupby place to the subset of data with that region and find the place with lowest population. (Assuming place would be repetitive in real data)

high_reg = df.groupby('region')['population'].sum().reset_index(name='count').sort_values('count').iloc[-1]['region']
df.loc[df['region']==high_reg].groupby('place')['population'].sum().reset_index(name='count').sort_values('count').iloc[0]['place']

Output:

'C'

answered Jun 20, 2018 at 14:57

harpan

8,6412 gold badges22 silver badges40 bronze badges

Collectives™ on Stack Overflow

select index value from groupby on a pandas dataframe in python

2 Answers 2

5 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related