TypeError: expected string or bytes-like object Regular expression removing special characters

Question

train dataframe with content column. content column has list for each row containing different words in that list.

content
[sure, tune, …, watch, donald, trump, “,”, late, ’ , night]
[abc, xyz, “,”,late, ’, night]

Code to remove regular expressions

import re
train['content'] = train['content'].map(lambda x: re.sub(r'\W+', '', x))

Error

TypeError: expected string or bytes-like object

Expected output

content
[sure, tune,  watch, donald, trump, late,   night]
[abc, xyz,late, night]

Notice all the special characters like ..., “, ” and ’ are gone and we are left only with words.

First off - each row of content must be a string itself, not an actual list, otherwise you'd have syntax errors before you even started, should probably clarify that. — elPastor
– elPastor, Commented Jun 26, 2020 at 13:08
Are all of the items in the list defined variables then? Either the list itself has to be a string, the items have to be strings / numbers, or they are variables that are previously defined. — elPastor
– elPastor, Commented Jun 26, 2020 at 13:11

ztepler · Accepted Answer · 2020-06-26 13:06:08Z

1

You are trying to apply regular expression to the List object.

If your goal is to use this regex on every item of the list, you can apply re.sub for each item in list:

import re
def replace_func(item):
    return re.sub(r'\W+', '', item)

train['content'] = train['content'].map(lambda x: [replace_func(item) for item in x])

answered Jun 26, 2020 at 13:06

ztepler

4602 silver badges10 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

noob Over a year ago

Does not work.......this is separating all the individual letters

ztepler Over a year ago

It would separate individual letters if there are strings in your content field. I assumed that all elements in the content field is a list. What does train.content.map(type).value_counts() shows?

grim · Accepted Answer · 2020-06-26 13:04:45Z

0

Just do:

content=['sure', 'tune', '…', 'watch', 'donald', 'trump', '“,”', 'late', '’' , 'night']
content = list(map(lambda x: re.sub(r'\W+', '', x),content))

answered Jun 26, 2020 at 13:04

grim

91 bronze badge

2 Comments

noob Over a year ago

content is a column in dataset. there are 30,000 rows in content column....these 2 rows are just an eg

grim Over a year ago

main idea is to parse strings by regex isn't it? Do content=train['content'].tolist() and do back reassign or do foreach on dataframe column; you need additional conversion anyway in my opinion.

Collectives™ on Stack Overflow

TypeError: expected string or bytes-like object Regular expression removing special characters

2 Answers 2

2 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related