1

I have two columns that is a combination of comma separated words and single words in a string format. col1 will always only contain one word. In this example I will use the word Dog as the word to have in col1, but this will differ in the real data, so please do not make a solution that uses regex on Dog specifically.

df = pd.DataFrame({"col1": ["Dog", "Dog", "Dog", "Dog"],
                     "col2": ["Cat, Mouse", "Dog", "Cat", "Dog, Mouse"]})

I want to check if the word in col1 appears in the string in col2, and if it does, I want to remove that word from col2. But keep in mind that I want to keep the rest of the string if there are more words left. So it will go from this:

    col1    col2    
0   Dog     Cat, Mouse
1   Dog     Dog
2   Dog     Cat
3   Dog     Dog, Mouse

To this:

    col1    col2
0   Dog     Cat, Mouse
1   Dog 
2   Dog     Cat
3   Dog     Mouse
2
  • 1
    IMHO, just go through data iteratively then use string.replace i.e. x.replace('Dog', '') Commented May 28, 2020 at 9:42
  • Did you read the question? That solution is not reproducible Commented May 28, 2020 at 9:45

3 Answers 3

3

Try this:

import re
df['col2'] = [(re.sub(fr"({word}[\s,]*)","",sentence)) 
             for word,sentence in zip(df.col1,df.col2)]
df

    col1    col2
0   Dog     Cat, Mouse
1   Dog 
2   Dog     Cat
3   Dog     Mouse

another df, with dog in the middle :

df = pd.DataFrame({"col1": ["Dog", "Dog", "Dog", "Dog","Dog"],
                     "col2": ["Cat, Mouse", "Dog", "Cat", "Dog, Mouse", "Cat, Dog, Mouse"]})

df


   col1     col2
0   Dog     Cat, Mouse
1   Dog     Dog
2   Dog     Cat
3   Dog     Dog, Mouse
4   Dog     Cat, Dog, Mouse

Apply the code above :

   col1     col2
0   Dog     Cat, Mouse
1   Dog 
2   Dog     Cat
3   Dog     Mouse
4   Dog     Cat, Mouse
Sign up to request clarification or add additional context in comments.

Comments

2

(^,|,$) to handle starting & trailing comma
(,\s|,) will remove comma those getting retained after replace operation.
{1,} to skip non-repeated comma

df['col2'] = df['col2'].str. \
    replace("|".join(df['col1'].unique()), "").str.strip() \
    .str.replace("(?:^,|,$)", "") \
    .str.replace("(?:,\s|,){1,}", ",")

  col1          col2
0  Dog     Cat,Mouse
1  Dog              
2  Dog           Cat
3  Dog   Mouse,Mouse

3 Comments

What if Dog appears in the middle of the string or in the end of the string? This will leave me with some excess commas. Sorry I did not specify this in the question
@torkestativ, is col1 going to have single value or multiple values ?
it will have a single value, @Sushanth
1

l=df.col1.tolist()#list of col1

Create set from col2, evaluate membership of l in set by finding difference applying lambda function.

df['col2']=list(zip(df.col2))
df['col2']=df.col2.apply(lambda x:[*{*x}-{*l}]).str[0]

enter image description here

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.