1

Asking a follow up question to my question here: Remove substring from string if substring in list in data frame column

I have the following data frame df1

       string             lists
0      I HAVE A PET DOG   ['fox', 'pet dog', 'cat']
1      there is a cat     ['dog', 'house', 'car']
2      hello EVERYONE     ['hi', 'hello', 'everyone']
3      hi my name is Joe  ['name', 'was', 'is Joe']

I'm trying to return a data frame df2 that looks like this

       string             lists                         new_string
0      I HAVE A PET DOG   ['fox', 'pet dog', 'cat']     I HAVE A
1      there is a cat     ['dog', 'house', 'car']       there is a cat
2      hello everyone     ['hi', 'hello', 'everyone']   
3      hi my name is Joe  ['name', 'was', 'is Joe']     hi my

The solution I was using does not work for cases where a substring is multiple words, such as pet dog or is Joe

df['new_string'] = df['string'].apply(lambda x: ' '.join([word for word in x.split() if word.lower() not in df['lists'][df['string'] == x].values[0]]))

1 Answer 1

1

The question is roughly similar, but still quite different.

In this case we use re.sub over the row axis (axis=1):

df.apply(lambda row: re.sub("|".join(row["lists"]), "", row["string"], flags=re.I), axis=1)
              string                  lists      new_string
0   I HAVE A PET DOG    [fox, pet dog, cat]       I HAVE A 
1     there is a cat      [dog, house, car]  there is a cat
2     hello EVERYONE  [hi, hello, everyone]                
3  hi my name is Joe    [name, was, is Joe]         hi my 

To break it down:

  1. df.apply with axis=1 applies a function to each row
  2. re.sub is the regex variant of str.replace
  3. We use "|".join to make a "|" seperated string, which acts as or operator in regex. So it removes one of these words.
  4. flags=re.I so it ignores case letters.

Note: since we use apply over the row axis, this is basically a loop in the background and thus not very optimimized.

Sign up to request clarification or add additional context in comments.

2 Comments

If any of the strings in row["lists"] might contain special characters (for example because it is sourced from user input) you need to escape them, like so: "|".join(re.escape(item) for item in row["lists"]). Otherwise the regular expression will not work as expected.
this worked perfectly! I'm working with a pretty large dataset so I might need to think of a different way to format the df to make this more efficient

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.