How to only keep rows which have more than one value in a pandas DataFrame?

Question

I often try to do the following operation, but there's an immediate solution which is most efficient in pandas:

I have the following example pandas DataFrame, whereby there are two columns, Name and Age:

import pandas as pd

data = [['Alex',10],['Bob',12],['Barbara',25], ['Bob',72], ['Clarke',13], ['Clarke',13], ['Destiny', 45]]

df = pd.DataFrame(data,columns=['Name','Age'], dtype=float)

print(df)
      Name   Age
0     Alex  10.0
1      Bob  12.0
2  Barbara  25.0
3      Bob  72.0
4   Clarke  13.0
5   Clarke  13.0
6  Destiny  45.0

I would like to remove all rows which do have a matching value in Name. In the example df, there are two Bob values and two Clarke values. The intended output would therefore be:

      Name   Age
0      Bob  12.0
1      Bob  72.0
2   Clarke  13.0
3   Clarke  13.0

whereby I'm assuming that there's a reset index.

One option would be to keep all unique values for Name in a list, and then iterate through the dataframe to check for duplicate rows. That would be very inefficient.

Is there a built-in function for this task?

Check out pd.duplicated.

hilberts_drinking_problem
– hilberts_drinking_problem

2018-12-12 01:08:52 +00:00
Commented Dec 12, 2018 at 1:08 — hilberts_drinking_problem
– hilberts_drinking_problem, Commented Dec 12, 2018 at 1:08
Please close this question; it is embarrassing

EB2127
– EB2127

2018-12-12 19:08:50 +00:00
Commented Dec 12, 2018 at 19:08 — EB2127
– EB2127, Commented Dec 12, 2018 at 19:08

BENY · Accepted Answer · 2018-12-12 02:05:12Z

4

Using duplicated

df[df.Name.duplicated(keep=False)]
     Name   Age
1     Bob  12.0
3     Bob  72.0
4  Clarke  13.0
5  Clarke  13.0

answered Dec 12, 2018 at 2:05

BENY

324k22 gold badges176 silver badges250 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Samveen Over a year ago

Maybe add a .reset_index(drop=1) postfix for extra credit?

U13-Forward · Accepted Answer · 2018-12-12 01:22:54Z

3

Use drop_duplicates, and only get the ones that are dropped:

print(df[~df['Name'].isin(df['Name'].drop_duplicates(False))])

Output:

     Name   Age
1     Bob  12.0
3     Bob  72.0
4  Clarke  13.0
5  Clarke  13.0

If care about the index, do:

print(df[~df['Name'].isin(df['Name'].drop_duplicates(False))].reset_index(drop=1))

Output:

     Name   Age
0     Bob  12.0
1     Bob  72.0
2  Clarke  13.0
3  Clarke  13.0

answered Dec 12, 2018 at 1:22

U13-Forward

71.8k15 gold badges100 silver badges125 bronze badges

6 Comments

BENY Over a year ago

Good answer , if you interesting the direct way

Samveen Over a year ago

A lot of research to answer this question, and I find your answer is exactly the code I get...

U13-Forward Over a year ago

@Samveen Happy that you found this, and that which helps :-)

Samveen Over a year ago

@U9-Forward which brings me to the point that it took me 15 minutes of research to answer this question. Makes you wonder about the question, doesn't it ;-).

EB2127 Over a year ago

@Samveen Hi there, I'm trying to get the moderators to delete this question, as it's too obvious. Please request that it's deleted

|

Collectives™ on Stack Overflow

How to only keep rows which have more than one value in a pandas DataFrame?

2 Answers 2

1 Comment

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related