Remove duplicates from dataframe, based on two columns A,B, keeping row with max value in another column C

Question

I have a pandas dataframe which contains duplicates values according to two columns (A and B):

I want to remove duplicates keeping the row with max value in column C. This would lead to:

I cannot figure out how to do that. Should I use drop_duplicates(), something else?

wpercy · Accepted Answer · 2019-01-30 16:33:40Z

137

You can do it using group by:

c_maxes = df.groupby(['A', 'B']).C.transform(max)
df = df.loc[df.C == c_maxes]

c_maxes is a Series of the maximum values of C in each group but which is of the same length and with the same index as df. If you haven't used .transform then printing c_maxes might be a good idea to see how it works.

Another approach using drop_duplicates would be

df.sort('C').drop_duplicates(subset=['A', 'B'], take_last=True)

Not sure which is more efficient but I guess the first approach as it doesn't involve sorting.

EDIT: From pandas 0.18 up the second solution would be

df.sort_values('C').drop_duplicates(subset=['A', 'B'], keep='last')

or, alternatively,

df.sort_values('C', ascending=False).drop_duplicates(subset=['A', 'B'])

In any case, the groupby solution seems to be significantly more performing:

%timeit -n 10 df.loc[df.groupby(['A', 'B']).C.max == df.C]
10 loops, best of 3: 25.7 ms per loop

%timeit -n 10 df.sort_values('C').drop_duplicates(subset=['A', 'B'], keep='last')
10 loops, best of 3: 101 ms per loop

edited Jan 30, 2019 at 16:33

wpercy

10.2k4 gold badges35 silver badges50 bronze badges

answered Aug 19, 2015 at 11:34

JoeCondron

8,9163 gold badges29 silver badges28 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

PV8 Over a year ago

don't forget to assign the new dataframe (in this case to df): df.sort_values('C', ascending=False).drop_duplicates(subset=['A', 'B'], inplace=True)

รยקคгรђשค Over a year ago

Adding to @PV8 ' comment, don't use inplace at all in the expression as it will not give expected results, assignment is still needed as no inplace work is done. Its better to do operations explicitly to avoid surprises.

PV8 Over a year ago

don't know what you are talking about, but the inplace command works in that case, check the answer to drop_duplicates stackoverflow.com/questions/23667369/…

display-name unset Over a year ago

take_last=True is not working, use keep='last' instead as per docs: pandas.pydata.org/docs/reference/api/…

JoeCondron Over a year ago

Please see the comments in the EDIT section.

|

b10n · Accepted Answer · 2015-08-19 11:39:42Z

16

I think groupby should work.

df.groupby(['A', 'B']).max()['C']

If you need a dataframe back you can chain the reset index call.

df.groupby(['A', 'B']).max()['C'].reset_index()

edited Aug 19, 2015 at 11:39

answered Aug 19, 2015 at 11:17

b10n

1,1869 silver badges9 bronze badges

1 Comment

JoeCondron Over a year ago

This will just return a Series of the max value of C in each group, indexed by 'A' and 'B'.

AlexT · Accepted Answer · 2017-12-05 13:47:41Z

9

You can do it with drop_duplicates as you wanted

# initialisation
d = pd.DataFrame({'A' : [1,1,2,3,3], 'B' : [2,2,7,4,4],  'C' : [1,4,1,0,8]})

d = d.sort_values("C", ascending=False)
d = d.drop_duplicates(["A","B"])

If it's important to get the same order

d = d.sort_index()

answered Dec 5, 2017 at 13:47

AlexT

911 silver badge4 bronze badges

Collectives™ on Stack Overflow

Remove duplicates from dataframe, based on two columns A,B, keeping row with max value in another column C

3 Answers 3

6 Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

6 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related