339

I have a dataframe with repeat values in column A. I want to drop duplicates, keeping the row with the highest value in column B.

So this:

A B
1 10
1 20
2 30
2 40
3 10

Should turn into this:

A B
1 20
2 40
3 10

I'm guessing there's probably an easy way to do this—maybe as easy as sorting the DataFrame before dropping duplicates—but I don't know groupby's internal logic well enough to figure it out. Any suggestions?

3
  • 1
    Note that the URL in the question appears EOL. Commented Jan 29, 2017 at 0:18
  • For an idiomatic and performant way, see this solution below. Commented Dec 2, 2017 at 3:59
  • Time has marched on... As of this writing, I believe this solution below is faster (at least in the case where there are lots of duplicates) and also simpler. Commented Aug 21, 2021 at 20:55

15 Answers 15

395

This takes the last. Not the maximum though:

In [10]: df.drop_duplicates(subset='A', keep="last")
Out[10]: 
   A   B
1  1  20
3  2  40
4  3  10

You can do also something like:

In [12]: df.groupby('A', group_keys=False).apply(lambda x: x.loc[x.B.idxmax()])
Out[12]: 
   A   B
A       
1  1  20
2  2  40
3  3  10
Sign up to request clarification or add additional context in comments.

9 Comments

Small note: The cols and take_last parameters are depreciated and have been replaced by the subset and keep parameters. pandas.pydata.org/pandas-docs/version/0.17.1/generated/…
as @Jezzamon says, FutureWarning: the take_last=True keyword is deprecated, use keep='last' instead
Is there a reason not to use df.sort_values(by=['B']).drop_duplicates(subset=['A'], keep='last')? I mean this sort_values seems safe to me but I have no idea if it actually is.
This answer is now obsolete. See @Ted Petrou's answer below.
If you want to use this code but with the case of more than one column in the group_by, you can add .reset_index(drop=True) df.groupby(['A','C'], group_keys=False).apply(lambda x: x.ix[x.B.idxmax()]).reset_index(drop=True) This will reset the index as its default value would be a Multindex compsed from 'A' and 'C'
|
171

The top answer is doing too much work and looks to be very slow for larger data sets. apply is slow and should be avoided if possible. ix is deprecated and should be avoided as well.

df.sort_values('B', ascending=False).drop_duplicates('A').sort_index()

   A   B
1  1  20
3  2  40
4  3  10

Or simply group by all the other columns and take the max of the column you need. df.groupby('A', as_index=False).max()

3 Comments

This is actually a cleaver approach. I was wondering if it can be generalized by using some lamba function while dropping. For example how can I drop only values lesser than say average of those duplicate values.
This is slower than groupby (because of the initial sort_values() which is O[n log n] and that groupby avoids). See a 2021 answer.
I’d rather sort ascending (default) and keep last record for the sake of self-explainability : df.sort_values('B').drop_duplicates('A', keep='last').sort_index().
61

Simplest solution:

To drop duplicates based on one column:

df = df.drop_duplicates('column_name', keep='last')

To drop duplicates based on multiple columns:

df = df.drop_duplicates(['col_name1','col_name2','col_name3'], keep='last')

5 Comments

My data frame has 10 columns, and I used this code to delete duplicates from three columns. However, it deleted the rows from the rest of the columns. Is there any way to delete the duplicates only for the 4 last columns?
But OP wants to keep the highest value in column B. This might work if you sorted first. But then it's basically Ted Petrou's answer.
remember to assign df back to df df = df.drop_duplicates. doing df.drop_duplicates(...) alone won't work
This answer assumes that the columns are sorted, which was not specified in the question.
This solution is wrong. You need to sort your dataframe ascending first for this solution to work reliably.
34

I would sort the dataframe first with Column B descending, then drop duplicates for Column A and keep first

df = df.sort_values(by='B', ascending=False)
df = df.drop_duplicates(subset='A', keep="first")

without any groupby

Comments

12

Try this:

df.groupby(['A']).max()

3 Comments

D'you know the best idiom to reindex this to look like the original DataFrame? I was trying to figure that out when you ninja'd me. :^)
Neat. What if the dataframe contains more columns (e.g. C, D, E)? Max doesn't seem to work in that case, because we need to specify that B is the only column that needs to be maximized.
@DSM Check the link in the original question. There's some code to reindex the grouped dataframe.
9

I was brought here by a link from a duplicate question.

For just two columns, wouldn't it be simpler to do:

df.groupby('A')['B'].max().reset_index()

And to retain a full row (when there are more columns, which is what the "duplicate question" that brought me here was asking):

df.loc[df.groupby(...)[column].idxmax()]

For example, to retain the full row where 'C' takes its max, for each group of ['A', 'B'], we would do:

out = df.loc[df.groupby(['A', 'B')['C'].idxmax()]

When there are relatively few groups (i.e., lots of duplicates), this is faster than the drop_duplicates() solution (less sorting):

Setup:

n = 1_000_000
df = pd.DataFrame({
    'A': np.random.randint(0, 20, n),
    'B': np.random.randint(0, 20, n),
    'C': np.random.uniform(size=n),
    'D': np.random.choice(list('abcdefghijklmnopqrstuvwxyz'), size=n),
})

(Adding sort_index() to ensure equal solution):

%timeit df.loc[df.groupby(['A', 'B'])['C'].idxmax()].sort_index()
# 101 ms ± 98.7 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit df.sort_values(['C', 'A', 'B'], ascending=False).drop_duplicates(['A', 'B']).sort_index()
# 667 ms ± 784 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

2 Comments

YMMV. In a small real use case with 1 % duplicates, the drop_duplicates() solution proved 13 times faster than the groupby() one.
And it’s even faster if you don’t sort by 'A' and 'B' : df.sort_values(by='C', ascending=False).drop_duplicates(['A', 'B']).sort_index().
5

Easiest way to do this:

# First you need to sort this DF as Column A as ascending and column B as descending 
# Then you can drop the duplicate values in A column 
# Optional - you can reset the index and get the nice data frame again
# I'm going to show you all in one step. 

d = {'A': [1,1,2,3,1,2,3,1], 'B': [30, 40,50,42,38,30,25,32]}
df = pd.DataFrame(data=d)
df

    A   B
0   1   30
1   1   40
2   2   50
3   3   42
4   1   38
5   2   30
6   3   25
7   1   32


df = df.sort_values(['A','B'], ascending =[True,False]).drop_duplicates(['A']).reset_index(drop=True)

df

    A   B
0   1   40
1   2   50
2   3   42

Comments

4

I think in your case you don't really need a groupby. I would sort by descending order your B column, then drop duplicates at column A and if you want you can also have a new nice and clean index like that:

df.sort_values('B', ascending=False).drop_duplicates('A').sort_index().reset_index(drop=True)

1 Comment

how is this any different than other posts?
2

You can try this as well

df.drop_duplicates(subset='A', keep='last')

I referred this from https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html

Comments

2

Here's a variation I had to solve that's worth sharing: for each unique string in columnA I wanted to find the most common associated string in columnB.

df.groupby('columnA').agg({'columnB': lambda x: x.mode().any()}).reset_index()

The .any() picks one if there's a tie for the mode. (Note that using .any() on a Series of ints returns a boolean rather than picking one of them.)

For the original question, the corresponding approach simplifies to

df.groupby('columnA').columnB.agg('max').reset_index().

Comments

0

When already given posts answer the question, I made a small change by adding the column name on which the max() function is applied for better code readability.

df.groupby('A', as_index=False)['B'].max()

1 Comment

Please give a little more context to your answers, explaining how they work and why they are superior or complementary to the answers already available for a question. If they do not provide added value, please refrain from posting additional answers on old questions. Finally, please format your code as a code block by indenting it.
0

Very similar method to the selected answer, but sorting data frame by multiple columns might be an easier way to code.

Firstly, sort the date frame by both "A" and "B" columns, the ascending=False ensure it is ranked from highest value to lowest:

df.sort_values(["A", "B"], ascending=False, inplace=True)

Then, drop duplication and keep only the first item, which is already the one with the highest value:

df.drop_duplicates(inplace=True)

Comments

0

In case you end up here, and have a dataframe with several equal columns (and some of them are different) and want to keep the original index:

df = (df.sort_values('B', ascending=False)
         .drop_duplicates(list(final_out_combined.columns.difference(['B'],sort=False)))
         .sort_index())

in the line drop_duplicates you can add the columns which can have a difference, so for example:

drop_duplicates(list(final_out_combined.columns.difference(['B', 'C'],sort=False)))

would mean B and C are not checking for duplicates.

Comments

-1

this also works:

a=pd.DataFrame({'A':a.groupby('A')['B'].max().index,'B':a.groupby('A')       ['B'].max().values})

1 Comment

While this code snippet may solve the question, including an explanation really helps to improve the quality of your post. Remember that you are answering the question for readers in the future, and those people might not know the reasons for your code suggestion. Please also try not to crowd your code with explanatory comments, this reduces the readability of both the code and the explanations!
-11

I am not going to give you the whole answer (I don't think you're looking for the parsing and writing to file part anyway), but a pivotal hint should suffice: use python's set() function, and then sorted() or .sort() coupled with .reverse():

>>> a=sorted(set([10,60,30,10,50,20,60,50,60,10,30]))
>>> a
[10, 20, 30, 50, 60]
>>> a.reverse()
>>> a
[60, 50, 30, 20, 10]

2 Comments

Maybe I'm wrong on this, but recasting a pandas DataFrame as a set, then converting it back seems like a very inefficient way to solve this problem. I'm doing log analysis, so I'll be applying this to some very big data sets.
Sorry, I don't know too much about this particular scenario, so it may be that my generic answer will not turn out to be too efficient for your problem.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.