Remove multi-word substring from string if substring in list in data frame column

Question

Asking a follow up question to my question here: Remove substring from string if substring in list in data frame column

I have the following data frame df1

       string             lists
0      I HAVE A PET DOG   ['fox', 'pet dog', 'cat']
1      there is a cat     ['dog', 'house', 'car']
2      hello EVERYONE     ['hi', 'hello', 'everyone']
3      hi my name is Joe  ['name', 'was', 'is Joe']

I'm trying to return a data frame df2 that looks like this

       string             lists                         new_string
0      I HAVE A PET DOG   ['fox', 'pet dog', 'cat']     I HAVE A
1      there is a cat     ['dog', 'house', 'car']       there is a cat
2      hello everyone     ['hi', 'hello', 'everyone']   
3      hi my name is Joe  ['name', 'was', 'is Joe']     hi my

The solution I was using does not work for cases where a substring is multiple words, such as pet dog or is Joe

df['new_string'] = df['string'].apply(lambda x: ' '.join([word for word in x.split() if word.lower() not in df['lists'][df['string'] == x].values[0]]))

Erfan · Accepted Answer · 2022-09-19 20:58:45Z

1

The question is roughly similar, but still quite different.

In this case we use re.sub over the row axis (axis=1):

df.apply(lambda row: re.sub("|".join(row["lists"]), "", row["string"], flags=re.I), axis=1)

              string                  lists      new_string
0   I HAVE A PET DOG    [fox, pet dog, cat]       I HAVE A 
1     there is a cat      [dog, house, car]  there is a cat
2     hello EVERYONE  [hi, hello, everyone]                
3  hi my name is Joe    [name, was, is Joe]         hi my

To break it down:

df.apply with axis=1 applies a function to each row
re.sub is the regex variant of str.replace
We use "|".join to make a "|" seperated string, which acts as or operator in regex. So it removes one of these words.
flags=re.I so it ignores case letters.

Note: since we use apply over the row axis, this is basically a loop in the background and thus not very optimimized.

answered Sep 19, 2022 at 20:58

Erfan

43.4k10 gold badges76 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Jasmijn Over a year ago

If any of the strings in row["lists"] might contain special characters (for example because it is sourced from user input) you need to escape them, like so: "|".join(re.escape(item) for item in row["lists"]). Otherwise the regular expression will not work as expected.

mjp Over a year ago

this worked perfectly! I'm working with a pretty large dataset so I might need to think of a different way to format the df to make this more efficient

Collectives™ on Stack Overflow

Remove multi-word substring from string if substring in list in data frame column

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related