How to find columns in duplicated rows where are different values in DataFrame in Python Pandas?

Question

I have DataFrame in Python like below where we can see duplicates for some ID:

ID	COL1	COL2	COL3
123	XX	111	ENG
123	abc	111	ENG
444	ccc	2	o
444	ccc	2	o
67	a	89	xx

And I need to select rows where is situation like for ID = 123, where rows are duplicated but in some column / columns we have different value, so as an output I need something like below:

ID	COL1	COL2	COL3
123	XX	111	ENG
123	abc	111	ENG

How can I do that in Python Pandas? I can add that in my real dataset I have many many more columns so I need to create solution whoch will be good for more columns not only ID,COL1,COL2,COL3 :)

Your question is answered here: stackoverflow.com/questions/67231430/… — Ari Lupin
– Ari Lupin, Commented Oct 9, 2022 at 19:25
Does this answer your question? In Pandas how do I select rows that have a duplicate in one column but different values in another? — René
– René, Commented Oct 9, 2022 at 19:27
Ari Lupin, Rene - mentioned questions do not answered my question, because i have many more columns, I mentioned quesitons there are only 1 columns with possible different values — unbik
– unbik, Commented Oct 9, 2022 at 19:28

Bushmaster · Accepted Answer · 2022-10-09 20:13:54Z

1

first drop duplicates for all columns then find duplicates for id column. finally select same ids.

df = df.drop_duplicates()
mask = df.duplicated(subset=['ID'],keep=False)
df = df[mask]

edited Oct 9, 2022 at 20:13

answered Oct 9, 2022 at 19:42

Bushmaster

4,6364 gold badges11 silver badges34 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Naveed · Accepted Answer · 2022-10-09 19:32:01Z

0

here is one way to do it

# drop the duplicates
df.drop_duplicates(inplace=True)

# groupby ID and filter the ones where group size is greater than 1
df[df.groupby('ID')['ID'].transform('size')>1]

    ID  COL1    COL2    COL3
0   123     XX  111     ENG
1   123     abc     111     ENG

alternately,

# preserve the original DF and create a secondary DF with non-duplicate rows
df2=df.drop_duplicates()

# using loc, select the rows in DF2 that has a group size exceeding 1
df2.loc[df2.groupby('ID')['ID'].transform('size')>1]

answered Oct 9, 2022 at 19:32

Naveed

11.7k2 gold badges16 silver badges21 bronze badges

Comments

Jason Baker · Accepted Answer · 2022-10-09 19:39:14Z

0

Using .query

df = df.query("ID.eq(123)").drop_duplicates().reset_index(drop=True)
print(df)

    ID COL1  COL2 COL3
0  123   XX   111  ENG
1  123  abc   111  ENG

Unless you aren't also trying to filter:

df = df.drop_duplicates().reset_index(drop=True)
print(df)

    ID COL1  COL2 COL3
0  123   XX   111  ENG
1  123  abc   111  ENG
2  444  ccc     2    o
3   67    a    89   xx

edited Oct 9, 2022 at 19:39

answered Oct 9, 2022 at 19:31

Jason Baker

3,7262 gold badges14 silver badges18 bronze badges

2 Comments

Bushmaster Over a year ago

This only works when the id field is equal to 123.

Jason Baker Over a year ago

Yeah hard to understand what you wanted exactly thats why I also gave a non filtering version.

Collectives™ on Stack Overflow

How to find columns in duplicated rows where are different values in DataFrame in Python Pandas?

3 Answers 3

Comments

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related