4

I have dataset like this

import pandas as pd
df = pd.DataFrame({'word': ['abs e learning ', 'abs e-learning', 'abs e&learning', 'abs elearning']})

I want to get

      word
0   abs elearning
1   abs elearning
2   abs elearning
3   abs elearning

I do as bellow

re_map = {r'\be learning\b': 'elearning', r'\be-learning\b': 'elearning', r'\be&learning\b': 'elearning'}
import re
for r, map in re_map.items():
            df['word'] = re.sub(r, map, df['word'])

and error

TypeError                                 Traceback (most recent call last)
<ipython-input-42-fbf00d9a0cba> in <module>()
      3 s = df['word']
      4 for r, map in re_map.items():
----> 5             df['word'] = re.sub(r, map, df['word'])

C:\Users\Edward\Anaconda3\lib\re.py in sub(pattern, repl, string, count, flags)
    180     a callable, it's passed the match object and must return
    181     a replacement string to be used."""
--> 182     return _compile(pattern, flags).sub(repl, string, count)
    183 
    184 def subn(pattern, repl, string, count=0, flags=0):

TypeError: expected string or bytes-like object

I can apply str like this

for r, map in re_map.items():
            df['word'] = re.sub(r, map, str(df['word']))

There is no mistake but i cann't get pd.dataFrame as i wish

    word
0   0 0 0 abs elearning \n1 abs elearning\...\n1 0 0 abs elearning \n1 abs elearning\...\n2 0 0 abs elearning \n1 abs ele...
1   0 0 0 abs elearning \n1 abs elearning\...\n1 0 0 abs elearning \n1 abs elearning\...\n2 0 0 abs elearning \n1 abs ele...
2   0 0 0 abs elearning \n1 abs elearning\...\n1 0 0 abs elearning \n1 abs elearning\...\n2 0 0 abs elearning \n1 abs ele...
3   0 0 0 abs elearning \n1 abs elearning\...\n1 0 0 abs elearning \n1 abs elearning\...\n2 0 0 abs elearning \n1 abs ele...

how to improve it?

1 Answer 1

11

df['word'] is a list. Converting to string just destroys your list.

You need to apply regex on each member:

for r, map in re_map.items():
    df['word'] = [re.sub(r, map, e) for e in df['word']]:

classical alternate method without list comprehension:

 for r, map in re_map.items():
     d = df['word']
     for i,e in enumerate(d):
         d[i] = re.sub(r, map, e)

BTW you could simplify your regex list drastically:

re_map = {r'\be[\-& ]learning\b': 'elearning'}

By doing that you only have one regex and this becomes a one-liner:

 df['word'] = [re.sub(r'\be[\-& ]learning\b', 'elearning', e) for e in df['word']]:

could even be faster by pre-compiling the regex once for all substitutions:

 theregex = re.compile(r'\be[\-& ]learning\b')
 df['word'] = [theregex.sub('elearning', e) for e in df['word']]:
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.