Identify duplicate based on multiple columns (may include multiple values) and return Boolean if identified duplicated in python

Question

This is a fairly simple query but I didn't find any relevant solution for my query. I have to identify duplicated based on multiple columns. The twist is, a column can have multiple values in a single row, which needs to be treated as separate values.

Example:

dct ={'store':('A','A','A','A','A','B','B','B','C','C','C','C'),
     'station':('aisle','aisle','aisle','window','window','aisle','aisle','aisle','aisle','window','window','window'),
     'produce':('apple','apple','cherry, apple','orange','orange','apple','apple,orange','orange','apple','apple','apple','orange')}

df = pd.DataFrame(dct)


print(df)

  store station produce
0   A   aisle   apple
1   A   aisle   apple
2   A   aisle   cherry, apple
3   A   window  orange
4   A   window  orange
5   B   aisle   apple
6   B   aisle   apple,orange
7   B   aisle   orange
8   C   aisle   apple
9   C   window  apple
10  C   window  apple
11  C   window  orange

Expected Dataframe:

store  station    produce          result
A      aisle      apple            False
A      aisle      apple            False
A      aisle      cherry,apple     False
A      window     orange           True
A      window     orange           True
B      aisle      apple            False
B      aisle      apple,orange     False
B      aisle      orange           False
C      aisle      apple            True        --> not duplicated; 'station' is diff 
C      window     apple            False   
C      window     apple            False
C      window     orange           True

I have been using df.duplicated(subset=['store','station','produce'], keep=False) but it is missing out the data with multiple values in a single row, any idea how this can be tackled?
Optional Addition: Similar function like "keep" (to determines which duplicates (if any) to mark) first/last/false

last part is completely optional and appreciated not a must :-)

just to be clear, the duplicates are marked False?

sammywemmy
– sammywemmy

2020-09-07 00:33:43 +00:00
Commented Sep 7, 2020 at 0:33 — sammywemmy
– sammywemmy, Commented Sep 7, 2020 at 0:33
Yes, False is good.

nealkaps
– nealkaps

2020-09-07 01:36:22 +00:00
Commented Sep 7, 2020 at 1:36 — nealkaps
– nealkaps, Commented Sep 7, 2020 at 1:36

BENY · Accepted Answer · 2020-09-07 00:50:24Z

3

Let us do

df['New'] = df.assign(produce=df['produce'].str.split(', ')).\
               explode('produce').\
               duplicated(subset=['store', 'station', 'produce'], keep=False).any(level=0)

Out[160]: 
0      True
1      True
2      True
3      True
4      True
5      True
6      True
7      True
8     False
9      True
10     True
11    False
dtype: bool

edited Sep 7, 2020 at 0:50

answered Sep 7, 2020 at 0:36

BENY

324k22 gold badges176 silver badges250 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

RichieV · Accepted Answer · 2020-09-06 23:52:53Z

You could use series.str.split to create lists of unique produce and df.explode to add new rows as needed, then check for duplicates.

df.produce = df.produce.str.split(',')
df = df.explode('produce')
df['result'] = df.duplicated(
    subset=['store', 'station', 'produce'],
    keep=False)

Output

   store station produce  result
0      A   aisle   apple    True
1      A   aisle   apple    True
2      A   aisle  cherry   False
2      A   aisle   apple   False
3      A  window  orange    True
4      A  window  orange    True
5      B   aisle   apple    True
6      B   aisle   apple    True
6      B   aisle  orange    True
7      B   aisle  orange    True
8      C   aisle   apple   False
9      C  window   apple    True
10     C  window   apple    True
11     C  window  orange   False

Collectives™ on Stack Overflow

Identify duplicate based on multiple columns (may include multiple values) and return Boolean if identified duplicated in python

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related