1

I want to filter out certain words from a pandas dataframe column and make a new column of the filtered text. I attempted the solution from here, but I think im having the issue of python thinking that I want to call the str.replace() instead of df.replace(). I'm not sure how to specify the latter as long as I'm calling it within a function.

df:

id     old_text 
0      my favorite color is blue
1      you have a dog
2      we built the house ourselves
3      i will visit you
def removeWords(txt):
     words = ['i', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself']
     txt = txt.replace('|'.join(words), '', regex=True)
     return txt

df['new_text'] = df['old_text'].apply(removeWords)

error:

TypeError: replace() takes no keyword arguments

desired output:

id     old_text                         new_text
0      my favorite color is blue        favorite color is blue
1      you have a dog                   have a dog
2      we built the house ourselves     built the house 
3      i will visit you                 will visit you

other things tried:

def removeWords(txt):
     words = ['i', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself']
     txt = [word for word in txt.split() if word not in words]
     return txt

df['new_text'] = df['old_text'].apply(removeWords)

this returns:

id     old_text                         new_text
0      my favorite color is blue        favorite, color, is, blue
1      you have a dog                   have, a, dog
2      we built the house ourselves     built, the, house 
3      i will visit you                 will, visit, you
2
  • Instead of using a function and apply, just use the built in method series.str.replace as in df['new_text'] = df['old_text'].str.replace(args) Commented Oct 23, 2020 at 15:06
  • I certainly could. However I'm trying to follow a convention and I'd like to understand this issue better Commented Oct 23, 2020 at 15:08

1 Answer 1

2

From this line:

txt.replace(rf"\b({'|'.join(words)})\b", '', regex=True)

This is the signature for pd.Series.replace so your function takes a series as input. On the other hand,

df['old_text'].apply(removeWords)

applies the function to each cell of df['old_text']. That means, txt would be just a string, and the signature for str.replace does not have keyword arguments (regex=True) in this case.

TLDR, you want to do:

df['new_text'] = removeWords(df['old_text'])

Output:

   id                      old_text                new_text
0   0     my favorite color is blue    favorte color s blue
1   1                you have a dog              have a dog
2   2  we built the house ourselves   bult the house selves
3   3              i will visit you                wll vst 

But as you can see, your function replaces the i within the words. You may want to modify the pattern so as it only replaces the whole words with the boundary indicator \b:

def removeWords(txt):
    words = ['i', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself']
    
    # note the `\b` here
    return txt.replace(rf"\b({'|'.join(words)})\b", '', regex=True)

Output:

   id                      old_text                 new_text
0   0     my favorite color is blue   favorite color is blue
1   1                you have a dog               have a dog
2   2  we built the house ourselves         built the house 
3   3              i will visit you              will visit 
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.