Forcing encoding on bad XML files with ElementTree

Question

A big set of XML files have the wrong encoding defined. It should be utf-8 but the content has latin-1 characters all over the place. What's the best way to parse this content?

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>

Edit: this is happening with Adobe InDesign IDML files, it seems the "Content" text has latin-1 but the rest could be utf-8. I'm favoring normal parsing with utf-8, then reencode the Unicode text chunks in Content to utf-8 and then re-parsing with latin-1. What a mess. ಠ_ಠ

Katriel · Accepted Answer · 2011-03-11 16:13:45Z

2

You can override the encoding specified in the XML when you parse it:

class xml.etree.ElementTree.XMLParser(html=0, target=None, encoding=None)
Element structure builder for XML source data, based on the expat parser. html are predefined HTML entities. This flag is not supported by the current implementation. target is the target object. If omitted, the builder uses an instance of the standard TreeBuilder class. encoding 1 is optional. If given, the value overrides the encoding specified in the XML file.

docs

answered Mar 11, 2011 at 16:13

Katriel

124k19 gold badges141 silver badges172 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

alecco Over a year ago

Ah, I tried this but got an error. Seems it's new in python 2.7. Thanks

Ekkehard.Horner · Accepted Answer · 2011-03-11 16:13:22Z

1

Don't try to deal with encoding problems during parse, but pre-process the offending file(s).

answered Mar 11, 2011 at 16:13

Ekkehard.Horner

38.8k2 gold badges50 silver badges101 bronze badges

1 Comment

alecco Over a year ago

It might be more complicated than I thought and there might be some real UTF-8 stuff in the files. I will have to un-encode from Unicode to utf-8, and then force reparsing on latin-1 for the specific places it might happen.

Collectives™ on Stack Overflow

Forcing encoding on bad XML files with ElementTree

2 Answers 2

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related