I am reading a ginormous (multi-gigabyte) XML file using Python's xml.etree.ElementTree module's iterparse() method. The problem is there are occasional Unicode errors (or at least what Python 3 thinks are Unicode errors) in some of the XML file's text. My loop is set up like this:
import xml.etree.ElementTree as etree
def foo():
# ...
f = open(filename, encoding='utf-8')
xmlit = iter(etree.iterparse(f, events=('start', 'end')))
(event, root) = next(xmlit)
for (event, elem) in xmlit: # line 26
if event != 'end':
continue
if elem.tag == 'foo':
do_something()
root.clear()
elif elem.tag == 'bar':
do_something_else()
root.clear()
# ...
When the element with the Unicode error is encountered, I get an error with the following traceback:
Traceback (most recent call last):
File "<path to above file>", line 26, in foo
for (event, elem) in xmlit:
File "C:\Python32\lib\xml\etree\ElementTree.py", line 1314, in __next__
self._parser.feed(data)
File "C:\Python32\lib\xml\etree\ElementTree.py", line 1668, in feed
self._parser.Parse(data, 0)
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 16383: surrogates not allowed
Since the error occurs in between for loop iterations, the only place I can wrap a try block is outside the for loop, which would mean I cannot continue to the next XML element.
My priorities for a solution are as follows:
- Receive a not-necessarily-valid Unicode string as the element's text, instead of having an exception raised.
- Receive a valid Unicode string with the invalid character replaced or removed.
- Skip the element with the invalid character and move on to the next one.
How can I implement any of these solutions, without going and modifying the ElementTree code myself?