0

I have an XML file with this structure:

<doc>
 <content>
  <one>Title</one>
  <two>bla bla bla bla</two>
 </content>
 <content>
  <one>Title</one>
  <two>bla bla bla bla</two>
 </content>
 ...
</doc>

I read the file in python through nltk package and parse the tree with ElementTree like this:

from xml.etree.ElementTree import ElementTree
wow = nltk.data.find('/path/file.xml')
tree = ElementTree().parse(wow)

Then I try to print something from 'two' elements like this:

for i, content in enumerate(tree.findall('content')):
    for two in content.findall('two'):
        if 'keyword' in str(two.text):
            print("%s" % (two.text))

And I get the infamous error:

Traceback (most recent call last):
   File "<stdin>", line 3, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 21: ordinal not in range(128)

I know this is due to incompatibility problems with ascii and UTF-8 encodings. The XML encoding is UTF-8. I tried several solutions found here on stackoverflow (mainly: I tried adding .encode('UTF-8') or .decode('UTF-8') here and there, or also encoding='utf-8' added in data.find), but the examples I found were quite different from mine, so I didn't manage to adapt those answers to my case: as you can imagine, I am new to python.

How can I avoid the error and print the content I need? Thanks.

3
  • try if u'keyword' in unicode(two.text): -- When you call str on an object, you coerce that object into string format, which uses the ascii codec. If your object contains non-ascii elements, that will throw an error. Commented Jan 31, 2015 at 17:16
  • No need to call unicode either -- two.text should already be unicode (if it isn't, the decoding needs an explicit codec anyway, and should happen earlier). Commented Jan 31, 2015 at 17:24
  • Thanks duhaime and Alex, the str was exactly what was causing the problem! I didn't know str forced the text in ascii. Commented Jan 31, 2015 at 17:59

1 Answer 1

2

So two.text should be a Unicode string and you want to print it -- why not just check

if u'keyword' in two.text:

and then if appropriate

print(two.text)

without the laborious stringification? If your terminal is properly set, it will tell Python which encoding to use to send it bytes properly representing that string for display purposes.

It's usually best to work uniformly in Unicode (that's why str has become unicode in Python 3:-) and only decode on input, encode on output -- and often the I/O systems will handle the decoding and encoding for you quite transparently.

Depending on your version of Python (which you don't tell us), you may need to do some explicit encoding -- as soon as possible, not late in the day. E.g, if you're stuck with Python 2, and wow is a Unicode string (depends on your version of nltk, I think), then

tree = ElementTree().parse(wow.encode('utf8'))

might work better; if wow is already a utf8-encoded byte string as it comes from nltk, then obviously you won't need to encode it again:-).

To remove such doubts, print(repr(wow[:30])) or thereabouts will tell you more. And print(sys.version) will tell you what version of Python so you can in turn tell us, as so few people appear to do even though it's most often absolutely crucial info!-)

Sign up to request clarification or add additional context in comments.

3 Comments

The error seems to be pointing to line 3 though, so wouldn't that imply the str() call is causing the problem?
@duhaime sure, the str(...) happens before the equivalent % formatting in the print -- both try to encode two.text as ascii. And neither is necessary! See my answer: no str call, no % in print either.
Sorry, I forgot to tell, my python version is 2.7 and cutting str worked perfectly! The problem is I was using and adapting code taken from nltk book, I thought str was crucial, and it wasn't. Thanks!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.