Remove duplicates by columns A, keeping the row with the highest value in column B

Question

I have a dataframe with repeat values in column A. I want to drop duplicates, keeping the row with the highest value in column B.

So this:

Should turn into this:

I'm guessing there's probably an easy way to do this—maybe as easy as sorting the DataFrame before dropping duplicates—but I don't know groupby's internal logic well enough to figure it out. Any suggestions?

For an idiomatic and performant way, see this solution below. — Ted Petrou
– Ted Petrou, Commented Dec 2, 2017 at 3:59
Time has marched on... As of this writing, I believe this solution below is faster (at least in the case where there are lots of duplicates) and also simpler. — Pierre D
– Pierre D, Commented Aug 21, 2021 at 20:55

Zero · Accepted Answer · 2017-10-04 18:03:43Z

395

This takes the last. Not the maximum though:

In [10]: df.drop_duplicates(subset='A', keep="last")
Out[10]: 
   A   B
1  1  20
3  2  40
4  3  10

You can do also something like:

In [12]: df.groupby('A', group_keys=False).apply(lambda x: x.loc[x.B.idxmax()])
Out[12]: 
   A   B
A       
1  1  20
2  2  40
3  3  10

edited Oct 4, 2017 at 18:03

Zero

77.4k22 gold badges153 silver badges153 bronze badges

answered Oct 25, 2012 at 0:10

Wes McKinney

106k32 gold badges146 silver badges109 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

Jezzamon Over a year ago

Small note: The cols and take_last parameters are depreciated and have been replaced by the subset and keep parameters. pandas.pydata.org/pandas-docs/version/0.17.1/generated/…

tumultous_rooster Over a year ago

as @Jezzamon says, FutureWarning: the take_last=True keyword is deprecated, use keep='last' instead

Little Bobby Tables Over a year ago

Is there a reason not to use df.sort_values(by=['B']).drop_duplicates(subset=['A'], keep='last')? I mean this sort_values seems safe to me but I have no idea if it actually is.

cxrodgers Over a year ago

This answer is now obsolete. See @Ted Petrou's answer below.

Hamri Said Over a year ago

If you want to use this code but with the case of more than one column in the group_by, you can add .reset_index(drop=True) df.groupby(['A','C'], group_keys=False).apply(lambda x: x.ix[x.B.idxmax()]).reset_index(drop=True) This will reset the index as its default value would be a Multindex compsed from 'A' and 'C'

|

Ted Petrou · Accepted Answer · 2017-08-09 16:24:35Z

171

The top answer is doing too much work and looks to be very slow for larger data sets. apply is slow and should be avoided if possible. ix is deprecated and should be avoided as well.

df.sort_values('B', ascending=False).drop_duplicates('A').sort_index()

   A   B
1  1  20
3  2  40
4  3  10

Or simply group by all the other columns and take the max of the column you need. df.groupby('A', as_index=False).max()

edited Aug 9, 2017 at 16:24

answered Jan 14, 2017 at 14:04

Ted Petrou

62.4k19 gold badges139 silver badges139 bronze badges

3 Comments

Dexter Over a year ago

This is actually a cleaver approach. I was wondering if it can be generalized by using some lamba function while dropping. For example how can I drop only values lesser than say average of those duplicate values.

Pierre D Over a year ago

This is slower than groupby (because of the initial sort_values() which is O[n log n] and that groupby avoids). See a 2021 answer.

Skippy le Grand Gourou Over a year ago

I’d rather sort ascending (default) and keep last record for the sake of self-explainability : df.sort_values('B').drop_duplicates('A', keep='last').sort_index().

Gil Baggio · Accepted Answer · 2019-03-06 11:13:47Z

61

Simplest solution:

To drop duplicates based on one column:

df = df.drop_duplicates('column_name', keep='last')

To drop duplicates based on multiple columns:

df = df.drop_duplicates(['col_name1','col_name2','col_name3'], keep='last')

answered Mar 6, 2019 at 11:13

Gil Baggio

14.1k3 gold badges51 silver badges37 bronze badges

5 Comments

rainbow12 Over a year ago

My data frame has 10 columns, and I used this code to delete duplicates from three columns. However, it deleted the rows from the rest of the columns. Is there any way to delete the duplicates only for the 4 last columns?

Teepeemm Over a year ago

But OP wants to keep the highest value in column B. This might work if you sorted first. But then it's basically Ted Petrou's answer.

chrizonline Over a year ago

remember to assign df back to df df = df.drop_duplicates. doing df.drop_duplicates(...) alone won't work

Denziloe Over a year ago

This answer assumes that the columns are sorted, which was not specified in the question.

GuD Over a year ago

This solution is wrong. You need to sort your dataframe ascending first for this solution to work reliably.

Nobel · Accepted Answer · 2020-03-18 10:59:15Z

34

I would sort the dataframe first with Column B descending, then drop duplicates for Column A and keep first

df = df.sort_values(by='B', ascending=False)
df = df.drop_duplicates(subset='A', keep="first")

without any groupby

edited Mar 18, 2020 at 10:59

answered Mar 18, 2020 at 10:46

Nobel

1,5551 gold badge16 silver badges20 bronze badges

Comments

eumiro · Accepted Answer · 2012-09-19 15:10:56Z

12

Try this:

df.groupby(['A']).max()

answered Sep 19, 2012 at 15:10

eumiro

214k36 gold badges307 silver badges264 bronze badges

3 Comments

DSM Over a year ago

D'you know the best idiom to reindex this to look like the original DataFrame? I was trying to figure that out when you ninja'd me. :^)

Abe Over a year ago

Neat. What if the dataframe contains more columns (e.g. C, D, E)? Max doesn't seem to work in that case, because we need to specify that B is the only column that needs to be maximized.

Abe Over a year ago

@DSM Check the link in the original question. There's some code to reindex the grouped dataframe.

Pierre D · Accepted Answer · 2021-08-21 21:01:08Z

9

I was brought here by a link from a duplicate question.

For just two columns, wouldn't it be simpler to do:

df.groupby('A')['B'].max().reset_index()

And to retain a full row (when there are more columns, which is what the "duplicate question" that brought me here was asking):

df.loc[df.groupby(...)[column].idxmax()]

For example, to retain the full row where 'C' takes its max, for each group of ['A', 'B'], we would do:

out = df.loc[df.groupby(['A', 'B')['C'].idxmax()]

When there are relatively few groups (i.e., lots of duplicates), this is faster than the drop_duplicates() solution (less sorting):

Setup:

n = 1_000_000
df = pd.DataFrame({
    'A': np.random.randint(0, 20, n),
    'B': np.random.randint(0, 20, n),
    'C': np.random.uniform(size=n),
    'D': np.random.choice(list('abcdefghijklmnopqrstuvwxyz'), size=n),
})

(Adding sort_index() to ensure equal solution):

%timeit df.loc[df.groupby(['A', 'B'])['C'].idxmax()].sort_index()
# 101 ms ± 98.7 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit df.sort_values(['C', 'A', 'B'], ascending=False).drop_duplicates(['A', 'B']).sort_index()
# 667 ms ± 784 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

edited Aug 21, 2021 at 21:01

answered Aug 21, 2021 at 20:49

Pierre D

26.6k8 gold badges71 silver badges108 bronze badges

2 Comments

Skippy le Grand Gourou Over a year ago

YMMV. In a small real use case with 1 % duplicates, the drop_duplicates() solution proved 13 times faster than the groupby() one.

Skippy le Grand Gourou Over a year ago

And it’s even faster if you don’t sort by 'A' and 'B' : df.sort_values(by='C', ascending=False).drop_duplicates(['A', 'B']).sort_index().

Peyman Mohamadpour · Accepted Answer · 2020-05-22 04:25:01Z

5

Easiest way to do this:

# First you need to sort this DF as Column A as ascending and column B as descending 
# Then you can drop the duplicate values in A column 
# Optional - you can reset the index and get the nice data frame again
# I'm going to show you all in one step. 

d = {'A': [1,1,2,3,1,2,3,1], 'B': [30, 40,50,42,38,30,25,32]}
df = pd.DataFrame(data=d)
df

    A   B
0   1   30
1   1   40
2   2   50
3   3   42
4   1   38
5   2   30
6   3   25
7   1   32


df = df.sort_values(['A','B'], ascending =[True,False]).drop_duplicates(['A']).reset_index(drop=True)

df

    A   B
0   1   40
1   2   50
2   3   42

edited May 22, 2020 at 4:25

Peyman Mohamadpour

18k24 gold badges94 silver badges102 bronze badges

answered May 22, 2020 at 3:33

rra

8191 gold badge9 silver badges24 bronze badges

Comments

whateveros · Accepted Answer · 2017-09-01 11:15:59Z

4

I think in your case you don't really need a groupby. I would sort by descending order your B column, then drop duplicates at column A and if you want you can also have a new nice and clean index like that:

df.sort_values('B', ascending=False).drop_duplicates('A').sort_index().reset_index(drop=True)

answered Sep 1, 2017 at 11:15

whateveros

614 bronze badges

1 Comment

DJK Over a year ago

how is this any different than other posts?

Merlin · Accepted Answer · 2017-08-05 18:48:25Z

2

You can try this as well

df.drop_duplicates(subset='A', keep='last')

I referred this from https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html

edited Aug 5, 2017 at 18:48

Merlin

25.9k44 gold badges141 silver badges213 bronze badges

answered May 27, 2017 at 13:30

Venkat

291 bronze badge

Comments

mistaben · Accepted Answer · 2019-09-20 17:25:53Z

2

Here's a variation I had to solve that's worth sharing: for each unique string in columnA I wanted to find the most common associated string in columnB.

df.groupby('columnA').agg({'columnB': lambda x: x.mode().any()}).reset_index()

The .any() picks one if there's a tie for the mode. (Note that using .any() on a Series of ints returns a boolean rather than picking one of them.)

For the original question, the corresponding approach simplifies to

df.groupby('columnA').columnB.agg('max').reset_index().

answered Sep 20, 2019 at 17:25

mistaben

211 silver badge3 bronze badges

Comments

Bhagabat Behera · Accepted Answer · 2018-06-24 12:43:15Z

0

When already given posts answer the question, I made a small change by adding the column name on which the max() function is applied for better code readability.

df.groupby('A', as_index=False)['B'].max()

edited Jun 24, 2018 at 12:43

answered Jun 24, 2018 at 11:34

Bhagabat Behera

8717 silver badges7 bronze badges

1 Comment

WhoIsJack Over a year ago

Please give a little more context to your answers, explaining how they work and why they are superior or complementary to the answers already available for a question. If they do not provide added value, please refrain from posting additional answers on old questions. Finally, please format your code as a code block by indenting it.

kikyo91 · Accepted Answer · 2022-09-02 15:53:24Z

0

Very similar method to the selected answer, but sorting data frame by multiple columns might be an easier way to code.

Firstly, sort the date frame by both "A" and "B" columns, the ascending=False ensure it is ranked from highest value to lowest:

df.sort_values(["A", "B"], ascending=False, inplace=True)

Then, drop duplication and keep only the first item, which is already the one with the highest value:

df.drop_duplicates(inplace=True)

answered Sep 2, 2022 at 15:53

kikyo91

371 gold badge1 silver badge8 bronze badges

Comments

PV8 · Accepted Answer · 2023-08-14 14:05:08Z

0

In case you end up here, and have a dataframe with several equal columns (and some of them are different) and want to keep the original index:

df = (df.sort_values('B', ascending=False)
         .drop_duplicates(list(final_out_combined.columns.difference(['B'],sort=False)))
         .sort_index())

in the line drop_duplicates you can add the columns which can have a difference, so for example:

drop_duplicates(list(final_out_combined.columns.difference(['B', 'C'],sort=False)))

would mean B and C are not checking for duplicates.

answered Aug 14, 2023 at 14:05

PV8

6,3669 gold badges54 silver badges113 bronze badges

Comments

Mahesh · Accepted Answer · 2017-01-14 15:16:38Z

-1

this also works:

a=pd.DataFrame({'A':a.groupby('A')['B'].max().index,'B':a.groupby('A')       ['B'].max().values})

answered Jan 14, 2017 at 15:16

Mahesh

1418 bronze badges

1 Comment

Martin Tournoij Over a year ago

While this code snippet may solve the question, including an explanation really helps to improve the quality of your post. Remember that you are answering the question for readers in the future, and those people might not know the reasons for your code suggestion. Please also try not to crowd your code with explanatory comments, this reduces the readability of both the code and the explanations!

Abhranil Das · Accepted Answer · 2012-09-19 15:10:29Z

-11

I am not going to give you the whole answer (I don't think you're looking for the parsing and writing to file part anyway), but a pivotal hint should suffice: use python's set() function, and then sorted() or .sort() coupled with .reverse():

>>> a=sorted(set([10,60,30,10,50,20,60,50,60,10,30]))
>>> a
[10, 20, 30, 50, 60]
>>> a.reverse()
>>> a
[60, 50, 30, 20, 10]

edited Sep 19, 2012 at 15:10

answered Sep 19, 2012 at 15:03

Abhranil Das

5,9466 gold badges37 silver badges45 bronze badges

2 Comments

Abe Over a year ago

Maybe I'm wrong on this, but recasting a pandas DataFrame as a set, then converting it back seems like a very inefficient way to solve this problem. I'm doing log analysis, so I'll be applying this to some very big data sets.

Abhranil Das Over a year ago

Sorry, I don't know too much about this particular scenario, so it may be that my generic answer will not turn out to be too efficient for your problem.

Collectives™ on Stack Overflow

Remove duplicates by columns A, keeping the row with the highest value in column B

15 Answers 15

9 Comments

3 Comments

5 Comments

Comments

3 Comments

2 Comments

Comments

1 Comment

Comments

Comments

1 Comment

Comments

Comments

1 Comment

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

15 Answers 15

9 Comments

3 Comments

5 Comments

Comments

3 Comments

2 Comments

Comments

1 Comment

Comments

Comments

1 Comment

Comments

Comments

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related