111

I have a pandas dataframe which contains duplicates values according to two columns (A and B):

A B C
1 2 1
1 2 4
2 7 1
3 4 0
3 4 8

I want to remove duplicates keeping the row with max value in column C. This would lead to:

A B C
1 2 4
2 7 1
3 4 8

I cannot figure out how to do that. Should I use drop_duplicates(), something else?

3 Answers 3

137

You can do it using group by:

c_maxes = df.groupby(['A', 'B']).C.transform(max)
df = df.loc[df.C == c_maxes]

c_maxes is a Series of the maximum values of C in each group but which is of the same length and with the same index as df. If you haven't used .transform then printing c_maxes might be a good idea to see how it works.

Another approach using drop_duplicates would be

df.sort('C').drop_duplicates(subset=['A', 'B'], take_last=True)

Not sure which is more efficient but I guess the first approach as it doesn't involve sorting.

EDIT: From pandas 0.18 up the second solution would be

df.sort_values('C').drop_duplicates(subset=['A', 'B'], keep='last')

or, alternatively,

df.sort_values('C', ascending=False).drop_duplicates(subset=['A', 'B'])

In any case, the groupby solution seems to be significantly more performing:

%timeit -n 10 df.loc[df.groupby(['A', 'B']).C.max == df.C]
10 loops, best of 3: 25.7 ms per loop

%timeit -n 10 df.sort_values('C').drop_duplicates(subset=['A', 'B'], keep='last')
10 loops, best of 3: 101 ms per loop
Sign up to request clarification or add additional context in comments.

6 Comments

don't forget to assign the new dataframe (in this case to df): df.sort_values('C', ascending=False).drop_duplicates(subset=['A', 'B'], inplace=True)
Adding to @PV8 ' comment, don't use inplace at all in the expression as it will not give expected results, assignment is still needed as no inplace work is done. Its better to do operations explicitly to avoid surprises.
don't know what you are talking about, but the inplace command works in that case, check the answer to drop_duplicates stackoverflow.com/questions/23667369/…
take_last=True is not working, use keep='last' instead as per docs: pandas.pydata.org/docs/reference/api/…
Please see the comments in the EDIT section.
|
16

I think groupby should work.

df.groupby(['A', 'B']).max()['C']

If you need a dataframe back you can chain the reset index call.

df.groupby(['A', 'B']).max()['C'].reset_index()

1 Comment

This will just return a Series of the max value of C in each group, indexed by 'A' and 'B'.
9

You can do it with drop_duplicates as you wanted

# initialisation
d = pd.DataFrame({'A' : [1,1,2,3,3], 'B' : [2,2,7,4,4],  'C' : [1,4,1,0,8]})

d = d.sort_values("C", ascending=False)
d = d.drop_duplicates(["A","B"])

If it's important to get the same order

d = d.sort_index()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.