Remove word within a string based on another columns value

Question

I have two columns that is a combination of comma separated words and single words in a string format. col1 will always only contain one word. In this example I will use the word Dog as the word to have in col1, but this will differ in the real data, so please do not make a solution that uses regex on Dog specifically.

df = pd.DataFrame({"col1": ["Dog", "Dog", "Dog", "Dog"],
                     "col2": ["Cat, Mouse", "Dog", "Cat", "Dog, Mouse"]})

I want to check if the word in col1 appears in the string in col2, and if it does, I want to remove that word from col2. But keep in mind that I want to keep the rest of the string if there are more words left. So it will go from this:

    col1    col2    
0   Dog     Cat, Mouse
1   Dog     Dog
2   Dog     Cat
3   Dog     Dog, Mouse

To this:

    col1    col2
0   Dog     Cat, Mouse
1   Dog 
2   Dog     Cat
3   Dog     Mouse

IMHO, just go through data iteratively then use string.replace i.e. x.replace('Dog', '') — ElSheikh
– ElSheikh, Commented May 28, 2020 at 9:42
Did you read the question? That solution is not reproducible — torkestativ
– torkestativ, Commented May 28, 2020 at 9:45

halfer · Accepted Answer · 2020-07-24 14:24:21Z

3

Try this:

import re
df['col2'] = [(re.sub(fr"({word}[\s,]*)","",sentence)) 
             for word,sentence in zip(df.col1,df.col2)]
df

    col1    col2
0   Dog     Cat, Mouse
1   Dog 
2   Dog     Cat
3   Dog     Mouse

another df, with dog in the middle :

df = pd.DataFrame({"col1": ["Dog", "Dog", "Dog", "Dog","Dog"],
                     "col2": ["Cat, Mouse", "Dog", "Cat", "Dog, Mouse", "Cat, Dog, Mouse"]})

df


   col1     col2
0   Dog     Cat, Mouse
1   Dog     Dog
2   Dog     Cat
3   Dog     Dog, Mouse
4   Dog     Cat, Dog, Mouse

Apply the code above :

   col1     col2
0   Dog     Cat, Mouse
1   Dog 
2   Dog     Cat
3   Dog     Mouse
4   Dog     Cat, Mouse

edited Jul 24, 2020 at 14:24

halfer

20.2k20 gold badges111 silver badges208 bronze badges

answered May 28, 2020 at 9:49

sammywemmy

28.9k4 gold badges21 silver badges35 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

sushanth · Accepted Answer · 2020-05-28 10:45:21Z

2

(^,|,$) to handle starting & trailing comma
(,\s|,) will remove comma those getting retained after replace operation.
{1,} to skip non-repeated comma

df['col2'] = df['col2'].str. \
    replace("|".join(df['col1'].unique()), "").str.strip() \
    .str.replace("(?:^,|,$)", "") \
    .str.replace("(?:,\s|,){1,}", ",")

  col1          col2
0  Dog     Cat,Mouse
1  Dog              
2  Dog           Cat
3  Dog   Mouse,Mouse

edited May 28, 2020 at 10:45

answered May 28, 2020 at 9:54

sushanth

8,2923 gold badges20 silver badges31 bronze badges

3 Comments

torkestativ Over a year ago

What if Dog appears in the middle of the string or in the end of the string? This will leave me with some excess commas. Sorry I did not specify this in the question

sushanth Over a year ago

@torkestativ, is col1 going to have single value or multiple values ?

torkestativ Over a year ago

it will have a single value, @Sushanth

wwnde · Accepted Answer · 2020-05-28 11:37:40Z

1

l=df.col1.tolist()#list of col1

Create set from col2, evaluate membership of l in set by finding difference applying lambda function.

df['col2']=list(zip(df.col2))
df['col2']=df.col2.apply(lambda x:[*{*x}-{*l}]).str[0]

edited May 28, 2020 at 11:37

answered May 28, 2020 at 10:44

wwnde

26.7k6 gold badges22 silver badges38 bronze badges

Collectives™ on Stack Overflow

Remove word within a string based on another columns value

3 Answers 3

Comments

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related