0
columnA  columnB   columnC
a          0         a
c          1        c|f
b          2        a|b|c

For such a dataframe, I want to change the columnC to:

columnA  columnB   columnC
    a          0         
    c          1        f
    b          2        a|c

for each element in columnC, I want to check whether it exists in the corresponding column A; if it exists, then drop it from colulmnC.

I want to write a function like:

df['columnC'] = df[['columnA', 'columnC']].apply(remove_duplicate)


def remove_duplicate(columnA, columnC):
    
    c_values = set(columnC.split('|'))

    if columnA in c_values.copy:
        c_values.remove(columnA)

    new_C = '|'.join(c_values)

    return c_values

But this complains:

TypeError: remove_duplicate() missing 1 required positional argument: 'columnC'

4 Answers 4

1

We can try with explode, map and groupby.agg

s = df['columnC'].str.split('|').explode().to_frame('columnC')
s1 = s.assign(columnA=s.index.map(df['columnA']))

df['columnC'] = s1.loc[s1['columnC'].ne(s1['columnA'])].groupby(level=0)['columnC'].agg('|'.join)

  columnA  columnB columnC
0       a        0     NaN
1       c        1       f
2       b        2     a|c
Sign up to request clarification or add additional context in comments.

5 Comments

Can the NaN be replaced with empty string ''?
yep df['columnC'].fillna('') @marlon
I got a KeyError for the df['columnC'] line complaining columnA
hmm, might be that your index wasn't int based lke i assumed, can you post the full stack trace @marlon
@marlon looks like you missed the first two lines of code, make sure you include al the code above not just the last line.
1
df1=df[~df['columnC'].isin(df['columnA'])]#drop cross colum duplicates
df1=df1.assign(columnC=df1['columnC'].str.split('|'))#Convert c to list
df1['columnC']=df1. apply(lambda x: set(x['columnC'])-set(x['columnA']), axis=1)#Sets to eliminate values in column A from Column C

4 Comments

for the columnC, the resulting value should still be a string delimited by '|'. How to achieve that?
I thought it didnt matter. Please try df1['columnC']=df1['columnC'].map(list).str.join('|')
Unfortunately, I still received non-string values which produces error in later code.
Works for me. What version are you running?
0

A list comprehension could also work:

outcome = ["|".join([ent for ent in right if ent != left])
           for left, right in 
           zip(df.columnA, df.columnC.str.split("|"))
           ]

df.assign(columnC=outcome)

    columnA     columnB     columnC
0      a            0   
1      c            1          f
2      b            2         a|c

Note that df.assign does not make the result permanent. You can reassign to the original df or just do:

df['columnC'] = outcome

3 Comments

Did you test it? Why does my test still show duplicates exist in two columns?
yes i did test it. using your sample data
note the df.assign does not replace your existing df. you have to df = df.assign to make the change
0

You can do an split('|') then join it back with a single apply statement.

df['columnC'] = df.apply(lambda x: '|'.join(i for i in x.columnC.split('|') if i != x.columnA) ,axis=1)

This should solve it

The output will be:

Before:

  columnA  columnB columnC
0       a        0       a
1       c        1     c|f
2       b        2   a|b|c

After:

  columnA  columnB columnC
0       a        0        
1       c        1       f
2       b        2     a|c

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.