UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80

Question

There are a lot of similar questions and I have tried every possible solution but can't seem to work it out. This is my code and I am working on Name Entity Recognition using Stanford Tagger.

from nltk.tag import StanfordNERTagger
st = StanfordNERTagger('stanford-ner\classifiers\english.all.3class.distsim.crf.ser.gz',
                   'stanford-ner\stanford-ner.jar', encoding='utf-8')
tuple_list = st.tag("Please pay €94 million.".split())
print(tuple_list)

This is the error I get.

Traceback (most recent call last):
File "C:/Users/Dell/PycharmProjects/CSSOP/ner2.py", line 4, in <module>
tuple_list = st.tag("He was the subject of the most expensive association football transfer when he moved from Manchester United to Real Madrid in 2009 in a transfer worth €94 million ($132 million).".split())
File "C:\ProgramData\Anaconda3\lib\site-packages\nltk\tag\stanford.py", line 71, in tag
return sum(self.tag_sents([tokens]), []) 
File "C:\ProgramData\Anaconda3\lib\site-packages\nltk\tag\stanford.py", line 95, in tag_sents
stanpos_output = stanpos_output.decode(encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 247: invalid start byte

Edit: This is not a file opening encoding issue as pointed in other similar question.

Are you certain that the encoding is 'utf-8', and not (eg) 'Windows-1252'? — PM 2Ring
– PM 2Ring, Commented Apr 30, 2017 at 8:44
The encoding is cp1252. 0x80 is the Euro character in that encoding. — alexis
– alexis, Commented Apr 30, 2017 at 9:44

alexis · Accepted Answer · 2017-04-30 09:43:37Z

1

You are getting a decoding error, when the nltk's Stanford wrapper tries to read back in the output of the Stanford recognizer (which is a java program). Clearly the recognizer has managed to create an invalid utf-8 file. Evidently, it does not check the data you pass it before it writes it out, so the problem is only discovered when Python tries to read it back in.

Now, at the very top of this table you'll see that 0x80 is how the Windows 1252 codepage encodes the Euro symbol. The implication is clear: Your Python source uses the Windows 1252 encoding, so that's what your string literal contains. The right solution here would be to switch your editor to using UTF-8, and fix the encoding of your program.

This behavior would make sense if you're using Python 2; but your snippet seems to be Python 3 (function form of print), so please clarify before I venture an alternative fix.

answered Apr 30, 2017 at 9:43

alexis

50.4k18 gold badges108 silver badges173 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

user6446052 Over a year ago

Yes, I am using python 3. Can you suggest how to fix the encoding of the editor? I am using Pycharm Professional as my python editor.

alexis Over a year ago

Never been near Pycharm, sorry.

Collectives™ on Stack Overflow

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related