Select rows from dataframe with same values on several columns but different value on another

Question

I have a pandas dataframe with four feature columns and one label column. There is some issue with the dataset. There are some rows with the same values for the features but are labelled differently. I know how to find duplicates for multiple columns using

df[df.duplicated(keep=False)]

How do I find duplicate features with conflicting labels though?

For example in the dataframe like this

    a    b    c    label
0   1    1    2     y
1   1    1    2     x
2   1    1    2     x
3   2    2    2     z
4   2    2    2     z

I want to output something below

a    b    c    label
1    1    2    y
1    1    2    x

You can provide the subset param to duplicated (and drop_duplicates) — Alex
– Alex, Commented Apr 9, 2020 at 19:18

Scott Boston · Accepted Answer · 2020-04-09 20:02:44Z

5

IIUC, try this:

df[df.groupby(['a','b','c'])['label'].transform('nunique') > 1]

Output:

   a  b  c label
0  1  1  2     y
1  1  1  2     x
2  1  1  2     x

answered Apr 9, 2020 at 20:02

Scott Boston

154k15 gold badges160 silver badges207 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Peritract · Accepted Answer · 2020-04-09 19:18:29Z

0

You can pass a list of columns to the subset parameter of .duplicated() to only consider those columns when checking for duplicates.

In your case, you would call df.duplicated(subset=["a", "b", "c"], keep=False).

answered Apr 9, 2020 at 19:18

Peritract

7695 silver badges13 bronze badges

1 Comment

ddd Over a year ago

I don't need to know the duplicates, I would like to drop the duplicates. But I still end up with lots of records because there are lots of records with unique feature values to begin with. I want to filter out these and only end up with the ones that have duplicate values for features but different labels. Sort of like groupby features and pick the ones out with nunique of label greater than 1

Collectives™ on Stack Overflow

Select rows from dataframe with same values on several columns but different value on another

2 Answers 2

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related