2

There are a lot of similar questions and I have tried every possible solution but can't seem to work it out. This is my code and I am working on Name Entity Recognition using Stanford Tagger.

from nltk.tag import StanfordNERTagger
st = StanfordNERTagger('stanford-ner\classifiers\english.all.3class.distsim.crf.ser.gz',
                   'stanford-ner\stanford-ner.jar', encoding='utf-8')
tuple_list = st.tag("Please pay €94 million.".split())
print(tuple_list)

This is the error I get.

Traceback (most recent call last):
File "C:/Users/Dell/PycharmProjects/CSSOP/ner2.py", line 4, in <module>
tuple_list = st.tag("He was the subject of the most expensive association football transfer when he moved from Manchester United to Real Madrid in 2009 in a transfer worth €94 million ($132 million).".split())
File "C:\ProgramData\Anaconda3\lib\site-packages\nltk\tag\stanford.py", line 71, in tag
return sum(self.tag_sents([tokens]), []) 
File "C:\ProgramData\Anaconda3\lib\site-packages\nltk\tag\stanford.py", line 95, in tag_sents
stanpos_output = stanpos_output.decode(encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 247: invalid start byte

Edit: This is not a file opening encoding issue as pointed in other similar question.

3
  • Possible duplicate of 'utf-8' codec can't decode byte 0x80 Commented Apr 30, 2017 at 7:04
  • Are you certain that the encoding is 'utf-8', and not (eg) 'Windows-1252'? Commented Apr 30, 2017 at 8:44
  • The encoding is cp1252. 0x80 is the Euro character in that encoding. Commented Apr 30, 2017 at 9:44

1 Answer 1

1

You are getting a decoding error, when the nltk's Stanford wrapper tries to read back in the output of the Stanford recognizer (which is a java program). Clearly the recognizer has managed to create an invalid utf-8 file. Evidently, it does not check the data you pass it before it writes it out, so the problem is only discovered when Python tries to read it back in.

Now, at the very top of this table you'll see that 0x80 is how the Windows 1252 codepage encodes the Euro symbol. The implication is clear: Your Python source uses the Windows 1252 encoding, so that's what your string literal contains. The right solution here would be to switch your editor to using UTF-8, and fix the encoding of your program.

This behavior would make sense if you're using Python 2; but your snippet seems to be Python 3 (function form of print), so please clarify before I venture an alternative fix.

Sign up to request clarification or add additional context in comments.

2 Comments

Yes, I am using python 3. Can you suggest how to fix the encoding of the editor? I am using Pycharm Professional as my python editor.
Never been near Pycharm, sorry.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.