This is a fairly simple query but I didn't find any relevant solution for my query. I have to identify duplicated based on multiple columns. The twist is, a column can have multiple values in a single row, which needs to be treated as separate values.
Example:
dct ={'store':('A','A','A','A','A','B','B','B','C','C','C','C'),
'station':('aisle','aisle','aisle','window','window','aisle','aisle','aisle','aisle','window','window','window'),
'produce':('apple','apple','cherry, apple','orange','orange','apple','apple,orange','orange','apple','apple','apple','orange')}
df = pd.DataFrame(dct)
print(df)
store station produce
0 A aisle apple
1 A aisle apple
2 A aisle cherry, apple
3 A window orange
4 A window orange
5 B aisle apple
6 B aisle apple,orange
7 B aisle orange
8 C aisle apple
9 C window apple
10 C window apple
11 C window orange
Expected Dataframe:
store station produce result
A aisle apple False
A aisle apple False
A aisle cherry,apple False
A window orange True
A window orange True
B aisle apple False
B aisle apple,orange False
B aisle orange False
C aisle apple True --> not duplicated; 'station' is diff
C window apple False
C window apple False
C window orange True
I have been using df.duplicated(subset=['store','station','produce'], keep=False)
but it is missing out the data with multiple values in a single row, any idea how this can be tackled?
Optional Addition:
Similar function like "keep" (to determines which duplicates (if any) to mark) first/last/false
last part is completely optional and appreciated not a must :-)
False?