PANDAS : How can I output duplicate values within a group?

Question

My dataframe has a category and subcategory column, and then a column with strings that sometimes are repeated.

My question is for each category (CAT A) which strings are repeated across subcategories (CAT B)

CAT A	CAT B	Strings
A1	B1	String1
A1	B1	String2
A1	B1	String3
A1	B2	String4
A1	B2	String5
A1	B2	String1
A2	B1	String1
A2	B1	String2
A2	B1	String3
A2	B2	String4
A2	B2	String5
A2	B2	String6

The output I am looking for

A1
Repeated strings in  B1 and B2
"String1"

---

A2
Repeated strings in  B1 and B2
None

I'm confused on how to group this and compare the groups.

Thanks

Does this answer your question? Find duplicates with groupby in Pandas — Michael Delgado
– Michael Delgado, Commented Aug 14, 2021 at 2:35

Anurag Dabas · Accepted Answer · 2021-08-14 09:32:32Z

0

You can try via duplicated() with keep=False

m=df.duplicated(subset=['CAT A','Strings'],keep=False)
#OR via groupby()+transform()
#m=df.groupby('CAT A')['Strings'].transform(lambda x:x.duplicated(keep=False))

Finally:

out=df.loc[m]

output of out:

   CAT A    CAT B   Strings
0   A1      B1      String1
5   A1      B2      String1

If needed a seperate column:

df.loc[m,'duplicated']=df.loc[m,'Strings']

output of out :

   CAT A CAT B  Strings duplicated
0     A1    B1  String1    String1
1     A1    B1  String2        NaN
2     A1    B1  String3        NaN
3     A1    B2  String4        NaN
4     A1    B2  String5        NaN
5     A1    B2  String1    String1
6     A2    B1  String1        NaN
7     A2    B1  String2        NaN
8     A2    B1  String3        NaN
9     A2    B2  String4        NaN
10    A2    B2  String5        NaN
11    A2    B2  String6        NaN

edited Aug 14, 2021 at 9:32

answered Aug 14, 2021 at 2:27

Anurag Dabas

24.3k9 gold badges25 silver badges41 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Henry Ecker Over a year ago

Seems a bit of a long way around when you can just set the subset: -> m = df.duplicated(subset=['CAT A', 'Strings'], keep=False)

Collectives™ on Stack Overflow

PANDAS : How can I output duplicate values within a group?

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related