5

I have a number of XML files I'd like to process with a script, converting them from whatever encoding that they're in to UTF-8.

Using the code given in this great answer I can do the conversion, but how can I read the encoding given in the XML header?

For example, I have many files which are already in UTF-8, which should be left alone:

<?xml version="1.0" encoding="utf-8"?>

However, I have a lot of files which do need to be converted:

<?xml version="1.0" encoding="windows-1255"?>

How can I detect the XML encoding specified in the headers of these files in Python? Better, after I detect and reencode the files, how then can I change this XML header to read "utf-8" to avoid processing it in the future?

3 Answers 3

5

Use lxml to do the parsing; you can then access the original encoding with:

from lxml import etree

with open(filename, 'r') as xmlfile:
    tree = etree.parse(xmlfile)
    if tree.docinfo.encoding == 'utf-8':
        # already in correct encoding, abort
        return

You can then use lxml to write the file out again in UTF-8.

Sign up to request clarification or add additional context in comments.

4 Comments

Is there a way to do this with Python's standard library?
@PiotrDobrogost: You could try to use the expat parser, but it'll be a lot more work if you need to re-code whole documents.
I was interested only in getting encoding declaration itself. However according to docs Expat doesn’t support as many encodings as Python does, and its repertoire of encodings can’t be extended; it supports UTF-8, UTF-16, ISO-8859-1 (Latin1), and ASCII. and I need to process documents with ISO-8859-2 encoding. The truth is Python still does not have adequate tools for XML in the standard library.
Thinking more about this the fact that Expat does not support parsing of ISO-8859-2 encoded XML file does not necessarily mean it can't handle getting encoding declaration alone as this is SAX parser and the xml declaration comes first in the file. As it turns out it is possible – see my answer.
2

How can I detect the XML encoding specified in the headers of these files in Python?

A solution by Rob Wolfe using only the standard library:

from xml.parsers import expat

s = """<?xml version='1.0' encoding='iso-8859-1'?>
       <book>
           <title>Title</title>
           <chapter>Chapter 1</chapter>
       </book>"""

class MyParser(object):
    def XmlDecl(self, version, encoding, standalone):
        print "XmlDecl", version, encoding, standalone

    def Parse(self, data):
        Parser = expat.ParserCreate()
        Parser.XmlDeclHandler = self.XmlDecl
        Parser.Parse(data, 1)

parser = MyParser()
parser.Parse(s)

2 Comments

But, as stated in your comments to my answers, expat doesn't actually support decoding codecs other than UTF-8, UTF-16, ISO-8859-1, and by extension, ASCII. So don't try to use this to recode documents in other codecs, because you'll get a Mojibake instead.
@MartijnPieters Right, this is only for reading encoding declaration.
0

I want to expand on @PiotrDobrogost's answer and actually write a class that retrieves XML document encoding:

from xml.parsers import expat

class XmlParser(object):
    '''class used to retrive xml documents encoding
    '''

    def get_encoding(self, xml):
        self.__parse(xml)
        return self.encoding

    def __xml_decl_handler(self, version, encoding, standalone):
        self.encoding = encoding

    def __parse(self, xml):
        parser = expat.ParserCreate()
        parser.XmlDeclHandler = self.__xml_decl_handler
        parser.Parse(xml)

And here is the example of its usage:

xml = """<?xml version='1.0' encoding='iso-8859-1'?>
    <book>
        <title>Title</title>
        <chapter>Chapter 1</chapter>
    </book>"""
parser = XmlParser()
encoding = parser.get_encoding(xml)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.