2

A big set of XML files have the wrong encoding defined. It should be utf-8 but the content has latin-1 characters all over the place. What's the best way to parse this content?

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>

Edit: this is happening with Adobe InDesign IDML files, it seems the "Content" text has latin-1 but the rest could be utf-8. I'm favoring normal parsing with utf-8, then reencode the Unicode text chunks in Content to utf-8 and then re-parsing with latin-1. What a mess. ಠ_ಠ

2 Answers 2

2

You can override the encoding specified in the XML when you parse it:

class xml.etree.ElementTree.XMLParser(html=0, target=None, encoding=None)

Element structure builder for XML source data, based on the expat parser. html are predefined HTML entities. This flag is not supported by the current implementation. target is the target object. If omitted, the builder uses an instance of the standard TreeBuilder class. encoding 1 is optional. If given, the value overrides the encoding specified in the XML file.

docs

Sign up to request clarification or add additional context in comments.

1 Comment

Ah, I tried this but got an error. Seems it's new in python 2.7. Thanks
1

Don't try to deal with encoding problems during parse, but pre-process the offending file(s).

1 Comment

It might be more complicated than I thought and there might be some real UTF-8 stuff in the files. I will have to un-encode from Unicode to utf-8, and then force reparsing on latin-1 for the specific places it might happen.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.