python tokenization UnicodeDecodeError

Question

I'm trying to tokenize some documents but I have this error

UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 6: ordinal not in range(128)

import nltk
import pandas as pd

df = pd.DataFrame(pd.read_csv('status2.csv'))
documents = df['status']

result = [nltk.word_tokenize(sent) for sent in documents]

I think it's the unicode problem so I added

documents = unicode(documents, 'utf-8')

another error

TypeError: coercing to Unicode: need string or buffer, Series found

print documents

1      Brandon Cachia ,All I know is that,you're so n...
2      Melissa Zejtunija:HAM AND CHEESE BIEX INI??? *...
3                         .........Where is my mind?????
4      Having a philosophical discussion with Trudy D...

Neapolitan · Accepted Answer · 2016-05-18 05:11:53Z

2

unicode operates on strings or bytes, but documents is a pandas Series.

Maybe:

result = [nltk.word_tokenize(unicode(sent, 'utf-8')) for sent in documents]

answered May 18, 2016 at 5:11

Neapolitan

2,17312 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

python tokenization UnicodeDecodeError

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related