1

I have a PANDAS dataframe with the following structure for example:

id,sex,age,rank,skill
1,M,9,1,A
1,M,8,2,G
1,M,10,3,F
2,F,10,3,M
2,F,8,4,W
2,F,6,4,O
3,M,5,1,Q
3,M,4,3,N
3,M,9,4,Y

Where my desired output after the groupby/apply operation to the dataframe is:

id,sex,age,rank,skill
1,M,8,1,A
2,F,6,3,M
3,M,4,1,Q

In other words, I am looking to groupby the id field, sex field does not change, the min() of age value, the min() of rank value, and the skill value that was present at the the min() of rank value.

I understand that multiple agg functions can be passed to the groupby in a dict, but it how to handle the values that are constant or depend on the results of a function in another field of the groupby I do not understand.

3 Answers 3

3

In you expected out put , it is min of rank , but in your explanation you mentioned it is max

My answer base on you expected output

df.groupby(['id','sex'],as_index=False).agg({'age':'min','rank':'min'}).\
merge(df.drop('age',1),on=['id','sex','rank'],how='left')

Out[931]: 
   id sex  age  rank skill
0   1   M    8     1     A
1   2   F    6     3     M
2   3   M    4     1     Q
Sign up to request clarification or add additional context in comments.

2 Comments

This is definitely the best approach. I find kbball's restructured version a little easier to follow.
@Pylander glad it help ~ also , step by step always is good to follow , will take consider in my future post.,Thank you ~ Nice Day
1

For columns that have constant values, you have several options: first, last, etc. For the skill value that corresponds to the highest (or minimum in your example) rank value, you need to use idxmin. For idxmin to work, skill should be the index so as the first step set it as index.

df.set_index('skill').groupby('id').agg({'sex': 'first', 
                                         'age': 'min', 
                                         'rank': ['min', 'idxmin']})
Out: 
     sex age rank       
   first min  min idxmin
id                      
1      M   8    1      A
2      F   6    3      M
3      M   4    1      Q

3 Comments

Thanks for explaining how to integrate idxmin. I also corrected the rank value to min(), my confusion. One additional question, is it simple to rename the idxmin output to the the original field name? I am likely to have many columns that need to be set this way in the real world example.
@Pylander Since the function is called on the rank column, the name will be associated with that. It is really a pain to rename multiindexes though. I'd construct a flat columns list from scratch myself. Wen's approach might be more suitable may be?
Yes, I ended up seeing the limitations of this approach unfortunately. Very clean format though which I like. Wen's answer reformatted by kbball will work best in the end.
1

+1 for Wen.

Mine has a few more steps but it's the same idea and perhaps easier to read if you're not following:

func = {'sex': 'min', 'age': 'min', 'rank': 'min'}

df_agg = df.groupby('id').agg(func)
df_agg = df_agg.reset_index()

df = df.drop('age', 1)
df = pd.merge(df_agg, df, on = ['id', 'sex', 'rank'])

Set the aggregations you want to apply to each column. Then group by id, using agg. You need to reset the index at this point or else you won't be able to perform the merge in the next step, as id will be treated as the index.

df still stores your original data-frame. Drop age from df, as you'll only need the minimized age, stored in df_agg. Then perform the merge on the columns you'd expect to match: id, sex and rank. You are merging on rank to pull the correct skill along for the ride.

1 Comment

I have to give the credit to Wen, but I am using your adapted solution in the end. Thanks!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.