I have an XML file with this structure:
<doc>
<content>
<one>Title</one>
<two>bla bla bla bla</two>
</content>
<content>
<one>Title</one>
<two>bla bla bla bla</two>
</content>
...
</doc>
I read the file in python through nltk package and parse the tree with ElementTree like this:
from xml.etree.ElementTree import ElementTree
wow = nltk.data.find('/path/file.xml')
tree = ElementTree().parse(wow)
Then I try to print something from 'two' elements like this:
for i, content in enumerate(tree.findall('content')):
for two in content.findall('two'):
if 'keyword' in str(two.text):
print("%s" % (two.text))
And I get the infamous error:
Traceback (most recent call last):
File "<stdin>", line 3, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 21: ordinal not in range(128)
I know this is due to incompatibility problems with ascii and UTF-8 encodings. The XML encoding is UTF-8. I tried several solutions found here on stackoverflow (mainly: I tried adding .encode('UTF-8') or .decode('UTF-8') here and there, or also encoding='utf-8' added in data.find), but the examples I found were quite different from mine, so I didn't manage to adapt those answers to my case: as you can imagine, I am new to python.
How can I avoid the error and print the content I need? Thanks.
if u'keyword' in unicode(two.text):-- When you callstron an object, you coerce that object into string format, which uses the ascii codec. If your object contains non-ascii elements, that will throw an error.unicodeeither --two.textshould already be unicode (if it isn't, the decoding needs an explicit codec anyway, and should happen earlier).strwas exactly what was causing the problem! I didn't knowstrforced the text in ascii.