2

I getting many XML files and some of them has wrong encoding (e.g. in xml header is ISO-8859-1, but all the strings are in UTF-8, and so on)

For parsing is used xml.etree.ElementTree and this also read xml header with encoding (which is sometimes wrong)

input_element = xml.etree.ElementTree.parse("input.xml").getroot()

I would like to force another encoding and ignore this from header.

Is there any simple way how to do this?

2
  • As with all of those things: It's better to fix the source of broken XML, than trying to build a consumer that accommodates the bugs of the producer. It's not possible that the XML declaration does not match the file encoding unless something is horribly wrong at the producing end. This should be addressed. Commented Mar 22, 2020 at 21:10
  • @Tomalak, Yes this is my wish, but it's out of my possibilities.. Commented Mar 23, 2020 at 7:52

1 Answer 1

6

If you are sure of the encoding, you can use open() to read the file into a string, and then use ElementTree.fromstring() to convert that string into an XML document.

with open("input.xml", encoding="Windows-1252") as fp:
    xml_string = fp.read()
    tree = ElementTree.fromstring(xml_string)

This will ignore the XML declaration, since the file is already decoded, albeit manually. For normal/compliant XML documents, this method is not recommended and you should use ElementTree.parse('filename') instead.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.