2

This is a fairly simple query but I didn't find any relevant solution for my query. I have to identify duplicated based on multiple columns. The twist is, a column can have multiple values in a single row, which needs to be treated as separate values.

Example:

dct ={'store':('A','A','A','A','A','B','B','B','C','C','C','C'),
     'station':('aisle','aisle','aisle','window','window','aisle','aisle','aisle','aisle','window','window','window'),
     'produce':('apple','apple','cherry, apple','orange','orange','apple','apple,orange','orange','apple','apple','apple','orange')}

df = pd.DataFrame(dct)


print(df)

  store station produce
0   A   aisle   apple
1   A   aisle   apple
2   A   aisle   cherry, apple
3   A   window  orange
4   A   window  orange
5   B   aisle   apple
6   B   aisle   apple,orange
7   B   aisle   orange
8   C   aisle   apple
9   C   window  apple
10  C   window  apple
11  C   window  orange

Expected Dataframe:

store  station    produce          result
A      aisle      apple            False
A      aisle      apple            False
A      aisle      cherry,apple     False
A      window     orange           True
A      window     orange           True
B      aisle      apple            False
B      aisle      apple,orange     False
B      aisle      orange           False
C      aisle      apple            True        --> not duplicated; 'station' is diff 
C      window     apple            False   
C      window     apple            False
C      window     orange           True
    

I have been using df.duplicated(subset=['store','station','produce'], keep=False) but it is missing out the data with multiple values in a single row, any idea how this can be tackled?
Optional Addition: Similar function like "keep" (to determines which duplicates (if any) to mark) first/last/false

last part is completely optional and appreciated not a must :-)

2
  • 1
    just to be clear, the duplicates are marked False? Commented Sep 7, 2020 at 0:33
  • Yes, False is good. Commented Sep 7, 2020 at 1:36

2 Answers 2

3

Let us do

df['New'] = df.assign(produce=df['produce'].str.split(', ')).\
               explode('produce').\
               duplicated(subset=['store', 'station', 'produce'], keep=False).any(level=0)

Out[160]: 
0      True
1      True
2      True
3      True
4      True
5      True
6      True
7      True
8     False
9      True
10     True
11    False
dtype: bool
Sign up to request clarification or add additional context in comments.

Comments

3

You could use series.str.split to create lists of unique produce and df.explode to add new rows as needed, then check for duplicates.

df.produce = df.produce.str.split(',')
df = df.explode('produce')
df['result'] = df.duplicated(
    subset=['store', 'station', 'produce'],
    keep=False)

Output

   store station produce  result
0      A   aisle   apple    True
1      A   aisle   apple    True
2      A   aisle  cherry   False
2      A   aisle   apple   False
3      A  window  orange    True
4      A  window  orange    True
5      B   aisle   apple    True
6      B   aisle   apple    True
6      B   aisle  orange    True
7      B   aisle  orange    True
8      C   aisle   apple   False
9      C  window   apple    True
10     C  window   apple    True
11     C  window  orange   False

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.