Python pandas check dataframe duplicate value as a condition in loc

Question

I have an Excel file in which I need to follow certain conditions and input in remarks column if it satisfy the condition. I get the necessary columns as DataFrames and here is how it looks:

svc_no   i_status   caller_id   f_status   result      remarks
11111    WO         11111       WO         Not Match   Duplicate svc_no 
22222    WO         22222       WO         Match
11111    WO         n/a         SP         Not Match   Duplicate svc_no

The conditions would be:

The svc_no is duplicated
One of the duplicate is equal value with caller_id
The other has a value of 'n/a' or 'NULL' in caller_id
Result is Not Match

I used .loc and write it this way

df.loc[(df['svc_no'] != 'NULL') & (df['svc_no'] == df['caller_id']) & (df['svc_no'].duplicated()) & (df['i_status'] == 'WO') & (df['f_status'] == 'WO') & (df['result'] == 'Not Match), [remarks]] = 'Duplicate svc_no'

This code maybe right for the row where the first duplicate appeared, it does not apply to the other row where the other duplicate appeared.

Question: Is there a way where I can compare two rows with duplicates and apply necessary conditions using .loc or is there a way around?

jpp · Accepted Answer · 2018-07-03 15:56:31Z

1

It's not clear what you want as your desired output. But you can find all svc_no covered by your criteria using a sequence of Boolean masks:

df = pd.DataFrame({'svc_no': [11111, 22222, 11111],
                   'caller_id': [11111, 22222, 'n/a'],
                   'result': ['Not Match', 'Match', 'Not Match']})

counts = df['svc_no'].value_counts()
grouper = df.groupby('svc_no')['caller_id']

cond1 = df['svc_no'].isin(counts[counts > 1].index)
cond2 = df['svc_no'].isin(df.loc[df['svc_no'] == df['caller_id'], 'svc_no'])
cond3 = df['svc_no'].isin(df.loc[grouper.apply(lambda x: x.isin(['n/a', 'NULL'])), 'svc_no'])
cond4 = df['svc_no'].isin(df.loc[df['result'] == 'Not Match', 'svc_no'])

df.loc[cond1 & cond2 & cond3 & cond4, 'remarks'] = 'Duplicate svc_no'

print(df)

  caller_id     result  svc_no           remarks
0     11111  Not Match   11111  Duplicate svc_no
1     22222      Match   22222               NaN
2       n/a  Not Match   11111  Duplicate svc_no

edited Jul 3, 2018 at 15:56

answered Jul 3, 2018 at 15:50

jpp

166k37 gold badges301 silver badges363 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

Ricky Aguilar Over a year ago

My desired output would be, after I satisfy the conditions, I will input string Duplicate svc_no in the remarks column. Though my problem is that my code applies for the first found duplicate value only.

jpp Over a year ago

@RickyAguilar, OK, I've updated, not sure it's exactly what you want, but it produces your desired output.

Ricky Aguilar Over a year ago

There was a SytaxError on df.loc[cond1 & cond2 & cond3 & cond4, 'remarks'] = 'Duplicate svc_no'~

jpp Over a year ago

@RickyAguilar, Did you copy/paste my code exactly? Works fine for me. Try restarting your session.

jpp Over a year ago

@RickyAguilar, I'll try and have a look a little later!

|

tobsecret · Accepted Answer · 2018-07-03 15:36:29Z

0

You have to tell duplicated that you want to mark all duplicates - by default it only marks everything but the first occurrence of a value:

df['svc_no'].duplicated(keep=False)

Collectives™ on Stack Overflow

Python pandas check dataframe duplicate value as a condition in loc

2 Answers 2

8 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

8 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related