In Python 2.7 on a Mac I'm printing file names retrieved with nltk's PlaintextCorpusReader:
infobasecorpus = PlaintextCorpusReader(corpus_root, '.*\.txt')
for fileid in infobasecorpus.fileids():
print fileid
and get UnicodeDecodeError: 'ascii', '100316-N1-The \xc2\xa3250bn cost of developing.txt', 14, 15, 'ordinal not in range(128)' because of the £ symbol in a filename.
As I understand things, fileid is a unicode string which I need to encode to the default encoding before I can print it, and the default encoding is ASCII.
If I use print fileid.encode('ascii', 'ignore'), I get the same error.
If I change the default encoding by setting encoding = "utf-8" in site.py, (per this advice) it works.
Can anyone tell me:
(a) why encode has failed
(b) why encoding works and
(c) what I should do if I'm doing something wrong here? (For example, this describes setting default encoding as 'an ugly hack' that leads to the misuse of strings and creation of buggy code.)
(Disclaimer: new to Python, very grateful for your patience if this is obvious)
=========================================== Update to respond to Rob:
Rob, here is the full text of the test code:
import sys
import os
from nltk.corpus import PlaintextCorpusReader
corpus_root = '/Users/richlyon/Documents/Filing/Infobase/'
infobasecorpus = PlaintextCorpusReader(corpus_root, '.*\.txt')
for fileid in infobasecorpus.fileids():
print type(fileid) # result <type 'str'>
fileid = fileid.decode('utf8')
print type(fileid) # result <type 'unicode'>
print fileid.encode('ascii')
I've set default encoding back to ascii and run it.
print fileid.encode('ascii') still fails on £ in a filename.
=========================================== Last update in case this is of help to anyone else.
I needed to write:
fileid = fileid.decode('utf8')
print fileid.encode('ascii', 'ignore')
but text = nltk.Text(infobasecorpus.words(fileid)) chokes if it is fed <type 'unicode'> strings, which seems to contradict the recommendation to immediately convert everything into unicode before further processing.
But now it works. Thanks all, and Rob in particular.
str.decode()instead?