0

My dataframe has a category and subcategory column, and then a column with strings that sometimes are repeated.

My question is for each category (CAT A) which strings are repeated across subcategories (CAT B)

CAT A CAT B Strings
A1 B1 String1
A1 B1 String2
A1 B1 String3
A1 B2 String4
A1 B2 String5
A1 B2 String1
A2 B1 String1
A2 B1 String2
A2 B1 String3
A2 B2 String4
A2 B2 String5
A2 B2 String6

The output I am looking for

A1
Repeated strings in  B1 and B2
"String1"

---

A2
Repeated strings in  B1 and B2
None

I'm confused on how to group this and compare the groups.

Thanks

1

1 Answer 1

0

You can try via duplicated() with keep=False

m=df.duplicated(subset=['CAT A','Strings'],keep=False)
#OR via groupby()+transform()
#m=df.groupby('CAT A')['Strings'].transform(lambda x:x.duplicated(keep=False))

Finally:

out=df.loc[m]

output of out:

   CAT A    CAT B   Strings
0   A1      B1      String1
5   A1      B2      String1

If needed a seperate column:

df.loc[m,'duplicated']=df.loc[m,'Strings']

output of out :

   CAT A CAT B  Strings duplicated
0     A1    B1  String1    String1
1     A1    B1  String2        NaN
2     A1    B1  String3        NaN
3     A1    B2  String4        NaN
4     A1    B2  String5        NaN
5     A1    B2  String1    String1
6     A2    B1  String1        NaN
7     A2    B1  String2        NaN
8     A2    B1  String3        NaN
9     A2    B2  String4        NaN
10    A2    B2  String5        NaN
11    A2    B2  String6        NaN
Sign up to request clarification or add additional context in comments.

1 Comment

Seems a bit of a long way around when you can just set the subset: -> m = df.duplicated(subset=['CAT A', 'Strings'], keep=False)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.