Understanding unicode in Python and UnicodeDecodeError

Question

In Python 2.7 on a Mac I'm printing file names retrieved with nltk's PlaintextCorpusReader:

infobasecorpus = PlaintextCorpusReader(corpus_root, '.*\.txt')
for fileid in infobasecorpus.fileids():
    print fileid

and get UnicodeDecodeError: 'ascii', '100316-N1-The \xc2\xa3250bn cost of developing.txt', 14, 15, 'ordinal not in range(128)' because of the £ symbol in a filename.

As I understand things, fileid is a unicode string which I need to encode to the default encoding before I can print it, and the default encoding is ASCII.

If I use print fileid.encode('ascii', 'ignore'), I get the same error.

If I change the default encoding by setting encoding = "utf-8" in site.py, (per this advice) it works.

Can anyone tell me: (a) why encode has failed (b) why encoding works and (c) what I should do if I'm doing something wrong here? (For example, this describes setting default encoding as 'an ugly hack' that leads to the misuse of strings and creation of buggy code.)

(Disclaimer: new to Python, very grateful for your patience if this is obvious)

=========================================== Update to respond to Rob:

Rob, here is the full text of the test code:

import sys
import os
from nltk.corpus import PlaintextCorpusReader

corpus_root = '/Users/richlyon/Documents/Filing/Infobase/'
infobasecorpus = PlaintextCorpusReader(corpus_root, '.*\.txt')

for fileid in infobasecorpus.fileids():
    print type(fileid)             # result <type 'str'>
    fileid = fileid.decode('utf8')
    print type(fileid)             # result <type 'unicode'>
    print fileid.encode('ascii')

I've set default encoding back to ascii and run it.

print fileid.encode('ascii') still fails on £ in a filename.

=========================================== Last update in case this is of help to anyone else.

I needed to write:

fileid = fileid.decode('utf8')
print fileid.encode('ascii', 'ignore')

but text = nltk.Text(infobasecorpus.words(fileid)) chokes if it is fed <type 'unicode'> strings, which seems to contradict the recommendation to immediately convert everything into unicode before further processing.

But now it works. Thanks all, and Rob in particular.

Since utf-8 is default encoding for 3.x I don't find this hack ugly. It'll help you if you have to port your code to 3.x someday. — Evpok
– Evpok, Commented May 23, 2011 at 8:56
I do not believe that "print fileid.encode('ascii', 'ignore')" won't work. — user2665694
– user2665694, Commented May 23, 2011 at 8:58

Rob Cowie · Accepted Answer · 2011-05-23 09:11:35Z

2

Check the type of the fileid object. I suspect it is not a unicode object as you suggest. The UnicodeDecodeError is being raised because of an implicit decode prior to python encoding the string for output (by print).

Once the string is successfully decoded (to unicode), you can then print it by explicitly encoding it with a codec supported by your terminal. If your terminal supports the display of unicode, you may not need to encode it before output.

infobasecorpus = PlaintextCorpusReader(corpus_root, '.*\.txt')
for fileid in infobasecorpus.fileids():
    fileid = fileid.decode('utf8') ## fileid is now a unicode object
    print fileid.encode('utf8')

Replace utf8 with whatever encoding is used by your filesystem (maybe latin1 on Windows?, not sure).

EDIT: Overriding the site-wide default encoding is considered a hack as it a) can hide programming issues which may mean your code is not portable across python installs and b) it can affect other code running from the same python installation. Further, being explicit about encoding and decoding your strings makes life easier when you return to your code later; You don't have to remember that you modified site.py

edited May 23, 2011 at 9:11

answered May 23, 2011 at 9:04

Rob Cowie

22.6k6 gold badges65 silver badges58 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Richard Over a year ago

Thanks Rob - it needed print fileid.encode('ascii', 'ignore') to work and I finally get the encode/decode thing, thanks to your explanation. Really appreciate your time.

Collectives™ on Stack Overflow

Understanding unicode in Python and UnicodeDecodeError

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related