1

I have a pandas dataframe with four feature columns and one label column. There is some issue with the dataset. There are some rows with the same values for the features but are labelled differently. I know how to find duplicates for multiple columns using

df[df.duplicated(keep=False)]

How do I find duplicate features with conflicting labels though?

For example in the dataframe like this

    a    b    c    label
0   1    1    2     y
1   1    1    2     x
2   1    1    2     x
3   2    2    2     z
4   2    2    2     z

I want to output something below

a    b    c    label
1    1    2    y
1    1    2    x
2
  • 2
    df.drop_duplicates() Commented Apr 9, 2020 at 19:16
  • You can provide the subset param to duplicated (and drop_duplicates) Commented Apr 9, 2020 at 19:18

2 Answers 2

5

IIUC, try this:

df[df.groupby(['a','b','c'])['label'].transform('nunique') > 1]

Output:

   a  b  c label
0  1  1  2     y
1  1  1  2     x
2  1  1  2     x
Sign up to request clarification or add additional context in comments.

Comments

0

You can pass a list of columns to the subset parameter of .duplicated() to only consider those columns when checking for duplicates.

In your case, you would call df.duplicated(subset=["a", "b", "c"], keep=False).

1 Comment

I don't need to know the duplicates, I would like to drop the duplicates. But I still end up with lots of records because there are lots of records with unique feature values to begin with. I want to filter out these and only end up with the ones that have duplicate values for features but different labels. Sort of like groupby features and pick the ones out with nunique of label greater than 1

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.