3

I'm trying to tokenize some documents but I have this error

UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 6: ordinal not in range(128)

import nltk
import pandas as pd

df = pd.DataFrame(pd.read_csv('status2.csv'))
documents = df['status']

result = [nltk.word_tokenize(sent) for sent in documents]

I think it's the unicode problem so I added

documents = unicode(documents, 'utf-8')

another error

TypeError: coercing to Unicode: need string or buffer, Series found

print documents

1      Brandon Cachia ,All I know is that,you're so n...
2      Melissa Zejtunija:HAM AND CHEESE BIEX INI??? *...
3                         .........Where is my mind?????
4      Having a philosophical discussion with Trudy D...

1 Answer 1

2

unicode operates on strings or bytes, but documents is a pandas Series.

Maybe:

result = [nltk.word_tokenize(unicode(sent, 'utf-8')) for sent in documents]
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.