How to get around Unicode errors in xml.etree.ElementTree.iterparse()?

Question

I am reading a ginormous (multi-gigabyte) XML file using Python's xml.etree.ElementTree module's iterparse() method. The problem is there are occasional Unicode errors (or at least what Python 3 thinks are Unicode errors) in some of the XML file's text. My loop is set up like this:

import xml.etree.ElementTree as etree

def foo():
    # ...
    f = open(filename, encoding='utf-8')
    xmlit = iter(etree.iterparse(f, events=('start', 'end')))
    (event, root) = next(xmlit)
    for (event, elem) in xmlit: # line 26
        if event != 'end':
            continue
        if elem.tag == 'foo':
            do_something()
            root.clear()
        elif elem.tag == 'bar':
            do_something_else()
            root.clear()
    # ...

When the element with the Unicode error is encountered, I get an error with the following traceback:

Traceback (most recent call last):
  File "<path to above file>", line 26, in foo
    for (event, elem) in xmlit:
  File "C:\Python32\lib\xml\etree\ElementTree.py", line 1314, in __next__
    self._parser.feed(data)
  File "C:\Python32\lib\xml\etree\ElementTree.py", line 1668, in feed
    self._parser.Parse(data, 0)
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 16383: surrogates not allowed

Since the error occurs in between for loop iterations, the only place I can wrap a try block is outside the for loop, which would mean I cannot continue to the next XML element.

My priorities for a solution are as follows:

Receive a not-necessarily-valid Unicode string as the element's text, instead of having an exception raised.
Receive a valid Unicode string with the invalid character replaced or removed.
Skip the element with the invalid character and move on to the next one.

How can I implement any of these solutions, without going and modifying the ElementTree code myself?

abarnert · Accepted Answer · 2013-01-04 20:16:55Z

4

First, all the stuff about ElementTree is probably irrelevant here. Try just enumerating the file returned by f = open(filename, encoding='utf-8'), and you will probably get the same error.

If so, the solution is to override the default encoding error handler, as explained in the docs:

errors is an optional string that specifies how encoding and decoding errors are to be handled–this cannot be used in binary mode. Pass 'strict' to raise a ValueError exception if there is an encoding error (the default of None has the same effect), or pass 'ignore' to ignore errors. (Note that ignoring encoding errors can lead to data loss.) 'replace' causes a replacement marker (such as '?') to be inserted where there is malformed data. When writing, 'xmlcharrefreplace' (replace with the appropriate XML character reference) or 'backslashreplace' (replace with backslashed escape sequences) can be used. Any other error handling name that has been registered with codecs.register_error() is also valid.

So, you can do this:

f = open(filename, encoding='utf-8', errors='replace')

This fits your second priority—the invalid characters will be replaced by '?'.

There is no way to fit your first priority, because there's no way to represent a "not-necessarily-valid Unicode string". A Unicode string is, by definition, a sequence of Unicode code points, and that's how Python treats the str type. If you have invalid UTF-8 and want to turn that into a string, you need to specify how it should be turned into a string—and that's what, errors is for.

You could, alternatively, open the file in binary mode, and leave the UTF-8 alone as a bytes object instead of trying to turn it into a Unicode str object, but then you can only use APIs that work with bytes objects. (I believe the lxml implementation of ElementTree can actually do this, but the built-in one can't, but don't quote me on that.) But even if you did that, it wouldn't get you very far, because the XML code itself is going to try to interpret the invalid UTF-8, and then it needs to know what you want to do with errors, and that's usually going to be harder to specify because it's farther down.

One last point:

Since the error occurs in between for loop iterations, the only place I can wrap a try block is outside the for loop, which would mean I cannot continue to the next XML element.

Well, you don't actually have to use a for loop; you can transform it into a while loop with explicit next calls. Any time you need to do this, it's usually a sign that you're doing something wrong—but sometimes it's a sign that you're dealing with a broken library, and it's the only workaround available.

This:

for (event, elem) in xmlit: # line 26
    doStuffWith(event, elem)

Is effectively equivalent to:

while True:
    try:
        event, elem = next(xmlit)
    except StopIteration:
        break
    doStuffWith(event, elem)

And now, there is an obvious place to add a try—although you don't even really need to; you can just attach another except to the existing try.

The problem is, what are you going to do here? There is no guarantee that the iterator will be able to continue after it throws an exception. In fact, all of the simplest ways to create iterators will not be able to do so. You can test for yourself whether that's true in this case.

In the rare cases when you need to this, and it actually helps, you'd probably want to wrap it up. Something like this:

def skip_exceptions(it):
    while True:
      try:
          yield next(it)
      except StopIteration:
          raise
      except Exception as e:
          logging.info('Skipping iteration because of exception {}'.format(e))

Then you just do:

for (event, elem) in skip_exceptions(xmlit):
    doStuffWith(event, elem)

edited Jan 4, 2013 at 20:16

answered Jan 4, 2013 at 20:08

abarnert

368k54 gold badges626 silver badges691 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Francis Avila Over a year ago

iterparse() can accept a binary stream. In fact it may be better to feed it binary and let the XMLParser object figure out encoding according to XML semantics. The OP's problem is probably that this isn't actually utf-8 but something else (best case) or that it's mixed-encoding and thus invalid XML (worst case; he needs to use an error handler like you suggest, but those characters are lost forever barring heroic efforts to reconstruct them.)

abarnert Over a year ago

@FrancisAvila: The OP says that there are "occasional Unicode errors"—if it were something other than UTF-8, there would usually be frequent errors, and even more frequent incorrect strings. At any rate, letting etree decode the bytes isn't going to solve the problem, it's going to require the exact same solution, but make it harder to implement (which I already explained in the answer). Also, if you look at what he's asking for, an error handler seems to be exactly what he wants here.

Matt Over a year ago

@abarnert Thank you very much! Is there any way to specify what to replace the error with? I don't want to confuse an encoding error with a literal '?'.

Matt Over a year ago

@abarnert Nevermind, the docs you linked to cover that subject well enough. Thanks!

Matt Over a year ago

@abarnert Actually, I just realized that this does not work. I investigated into the source of the error and it turns out the UTF-8 is perfectly valid. Python can read, decode, and encode that line of the XML file perfectly fine. It is only when I try to read it with ElementTree that there is a problem. The error occurs when reading a character outside the Basic Multilingual Plane, thus being composed of a surrogate pair. Surrogate pairs work perfectly when together but when separated cause this sort of error. So the problem is somewhere within ElementTree, is there any way around this?

|

Collectives™ on Stack Overflow

How to get around Unicode errors in xml.etree.ElementTree.iterparse()?

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related