2

I have a pandas data frame with a column named "content" that contains text. I want to remove some words from each text within this column. I thought of replacing each string by empty string, but when I print the result of my function I see that the words have not been removed. My code is below:

def replace_words(t):
  words = ['Livre', 'Chapitre', 'Titre', 'Chapter', 'Article' ]
  for i in t:
    if i in words:
      t.replace (i, '')
    else:
      continue
  print(t)


st = 'this is Livre and Chapitre and Titre and Chapter and Article'

replace_words(st)

An example of desired result is: 'this is and and and and '

With the code below I want to apply the function above to each text in the column "content":

df['content'].apply(lambda x: replace_words(x))

Can someone help me to create a function that removes all the words I need and then apply this function to all the texts within my df column?

2 Answers 2

2

You can use str.replace.
Input:

df = pd.DataFrame({
    'ID' : np.arange(4),
    'words' : ['this is Livre and Chapitre and Titre and Chapter and Article', 
               'this is car and Chapitre and bus and Chapter and Article',
              'this is Livre and Chapitre',
              'nothing to replace']
})
words = ['Livre', 'Chapitre', 'Titre', 'Chapter', 'Article']
pat = '|'.join(map(re.escape, words))
print(pat)
'Livre|Chapitre|Titre|Chapter|Article'
df['words'] = df['words'].str.replace(pat, '', regex=True)
print(df)
   ID                               words
0   0        this is  and  and  and  and 
1   1  this is car and  and bus and  and 
2   2                       this is  and 
3   3                  nothing to replace
Sign up to request clarification or add additional context in comments.

2 Comments

much better: df['words'] = df['words'].str.replace(pat, '', regex=True)
If you are using stopwords including a, e, i, o, u or syllables, str.replace removes those letters or syllables inside words. I had that problem and solved it using: df['words'] = df['words'].apply(lambda x: ' '.join([word for word in x.split() if word not in (words)]))
1

Two problems:

  1. If you split using for i in t: each i is a letter, not a word.
  2. t.replace does not work inplace

Use this:

def replace_words(t):
    words = ['Livre', 'Chapitre', 'Titre', 'Chapter', 'Article' ]
    for i in t.split(' '):
        # print(i) # remove to see problem 1
        if i in words:
            t= t.replace (i, '')
        else:
            continue
    # print(t)
    return t

Edit: You can directly call df['col'].apply(replace_words).

3 Comments

ok, the function works perfectly but after applying the function to the column using df['col'].apply(replace_words) I don't see the words replaced in texts of the columns
did you test returning the t variable in the final of the function?
Exactly, you have to return t and not just print it

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.