How to remove duplicates by comparing two columns' values?

Question

columnA  columnB   columnC
a          0         a
c          1        c|f
b          2        a|b|c

For such a dataframe, I want to change the columnC to:

columnA  columnB   columnC
    a          0         
    c          1        f
    b          2        a|c

for each element in columnC, I want to check whether it exists in the corresponding column A; if it exists, then drop it from colulmnC.

I want to write a function like:

df['columnC'] = df[['columnA', 'columnC']].apply(remove_duplicate)


def remove_duplicate(columnA, columnC):
    
    c_values = set(columnC.split('|'))

    if columnA in c_values.copy:
        c_values.remove(columnA)

    new_C = '|'.join(c_values)

    return c_values

But this complains:

TypeError: remove_duplicate() missing 1 required positional argument: 'columnC'

Umar.H · Accepted Answer · 2021-02-16 23:28:41Z

1

We can try with explode, map and groupby.agg

s = df['columnC'].str.split('|').explode().to_frame('columnC')
s1 = s.assign(columnA=s.index.map(df['columnA']))

df['columnC'] = s1.loc[s1['columnC'].ne(s1['columnA'])].groupby(level=0)['columnC'].agg('|'.join)

  columnA  columnB columnC
0       a        0     NaN
1       c        1       f
2       b        2     a|c

answered Feb 16, 2021 at 23:28

Umar.H

23.1k7 gold badges50 silver badges94 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

marlon Over a year ago

Can the NaN be replaced with empty string ''?

Umar.H Over a year ago

yep df['columnC'].fillna('') @marlon

marlon Over a year ago

I got a KeyError for the df['columnC'] line complaining columnA

Umar.H Over a year ago

hmm, might be that your index wasn't int based lke i assumed, can you post the full stack trace @marlon

Umar.H Over a year ago

@marlon looks like you missed the first two lines of code, make sure you include al the code above not just the last line.

wwnde · Accepted Answer · 2021-02-16 23:37:54Z

1

df1=df[~df['columnC'].isin(df['columnA'])]#drop cross colum duplicates
df1=df1.assign(columnC=df1['columnC'].str.split('|'))#Convert c to list
df1['columnC']=df1. apply(lambda x: set(x['columnC'])-set(x['columnA']), axis=1)#Sets to eliminate values in column A from Column C

answered Feb 16, 2021 at 23:37

wwnde

26.7k6 gold badges22 silver badges38 bronze badges

4 Comments

marlon Over a year ago

for the columnC, the resulting value should still be a string delimited by '|'. How to achieve that?

wwnde Over a year ago

I thought it didnt matter. Please try df1['columnC']=df1['columnC'].map(list).str.join('|')

marlon Over a year ago

Unfortunately, I still received non-string values which produces error in later code.

wwnde Over a year ago

Works for me. What version are you running?

sammywemmy · Accepted Answer · 2021-02-16 23:57:00Z

0

A list comprehension could also work:

outcome = ["|".join([ent for ent in right if ent != left])
           for left, right in 
           zip(df.columnA, df.columnC.str.split("|"))
           ]

df.assign(columnC=outcome)

    columnA     columnB     columnC
0      a            0   
1      c            1          f
2      b            2         a|c

Note that df.assign does not make the result permanent. You can reassign to the original df or just do:

df['columnC'] = outcome

edited Feb 16, 2021 at 23:57

answered Feb 16, 2021 at 23:47

sammywemmy

28.9k4 gold badges21 silver badges35 bronze badges

3 Comments

marlon Over a year ago

Did you test it? Why does my test still show duplicates exist in two columns?

sammywemmy Over a year ago

yes i did test it. using your sample data

sammywemmy Over a year ago

note the df.assign does not replace your existing df. you have to df = df.assign to make the change

Joe Ferndz · Accepted Answer · 2021-02-17 00:13:08Z

0

You can do an split('|') then join it back with a single apply statement.

df['columnC'] = df.apply(lambda x: '|'.join(i for i in x.columnC.split('|') if i != x.columnA) ,axis=1)

This should solve it

The output will be:

Before:

  columnA  columnB columnC
0       a        0       a
1       c        1     c|f
2       b        2   a|b|c

After:

  columnA  columnB columnC
0       a        0        
1       c        1       f
2       b        2     a|c

answered Feb 17, 2021 at 0:13

Joe Ferndz

8,5282 gold badges15 silver badges37 bronze badges

Collectives™ on Stack Overflow

How to remove duplicates by comparing two columns' values?

4 Answers 4

5 Comments

4 Comments

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

5 Comments

4 Comments

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related