Reading XML header encoding

Question

I have a number of XML files I'd like to process with a script, converting them from whatever encoding that they're in to UTF-8.

Using the code given in this great answer I can do the conversion, but how can I read the encoding given in the XML header?

For example, I have many files which are already in UTF-8, which should be left alone:

<?xml version="1.0" encoding="utf-8"?>

However, I have a lot of files which do need to be converted:

<?xml version="1.0" encoding="windows-1255"?>

How can I detect the XML encoding specified in the headers of these files in Python? Better, after I detect and reencode the files, how then can I change this XML header to read "utf-8" to avoid processing it in the future?

Martijn Pieters · Accepted Answer · 2014-09-11 21:01:57Z

5

Use lxml to do the parsing; you can then access the original encoding with:

from lxml import etree

with open(filename, 'r') as xmlfile:
    tree = etree.parse(xmlfile)
    if tree.docinfo.encoding == 'utf-8':
        # already in correct encoding, abort
        return

You can then use lxml to write the file out again in UTF-8.

edited Sep 11, 2014 at 21:01

answered Sep 11, 2014 at 20:28

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Piotr Dobrogost Over a year ago

Is there a way to do this with Python's standard library?

Martijn Pieters Over a year ago

@PiotrDobrogost: You could try to use the expat parser, but it'll be a lot more work if you need to re-code whole documents.

Piotr Dobrogost Over a year ago

I was interested only in getting encoding declaration itself. However according to docs Expat doesn’t support as many encodings as Python does, and its repertoire of encodings can’t be extended; it supports UTF-8, UTF-16, ISO-8859-1 (Latin1), and ASCII. and I need to process documents with ISO-8859-2 encoding. The truth is Python still does not have adequate tools for XML in the standard library.

Piotr Dobrogost Over a year ago

Thinking more about this the fact that Expat does not support parsing of ISO-8859-2 encoded XML file does not necessarily mean it can't handle getting encoding declaration alone as this is SAX parser and the xml declaration comes first in the file. As it turns out it is possible – see my answer.

Linkid · Accepted Answer · 2016-03-10 13:54:29Z

2

How can I detect the XML encoding specified in the headers of these files in Python?

A solution by Rob Wolfe using only the standard library:

from xml.parsers import expat

s = """<?xml version='1.0' encoding='iso-8859-1'?>
       <book>
           <title>Title</title>
           <chapter>Chapter 1</chapter>
       </book>"""

class MyParser(object):
    def XmlDecl(self, version, encoding, standalone):
        print "XmlDecl", version, encoding, standalone

    def Parse(self, data):
        Parser = expat.ParserCreate()
        Parser.XmlDeclHandler = self.XmlDecl
        Parser.Parse(data, 1)

parser = MyParser()
parser.Parse(s)

edited Mar 10, 2016 at 13:54

Linkid

5576 silver badges17 bronze badges

answered Feb 19, 2016 at 15:27

Piotr Dobrogost

42.7k47 gold badges243 silver badges379 bronze badges

2 Comments

Martijn Pieters Over a year ago

But, as stated in your comments to my answers, expat doesn't actually support decoding codecs other than UTF-8, UTF-16, ISO-8859-1, and by extension, ASCII. So don't try to use this to recode documents in other codecs, because you'll get a Mojibake instead.

Piotr Dobrogost Over a year ago

@MartijnPieters Right, this is only for reading encoding declaration.

Mykhailo Seniutovych · Accepted Answer · 2018-10-26 12:51:48Z

I want to expand on @PiotrDobrogost's answer and actually write a class that retrieves XML document encoding:

from xml.parsers import expat

class XmlParser(object):
    '''class used to retrive xml documents encoding
    '''

    def get_encoding(self, xml):
        self.__parse(xml)
        return self.encoding

    def __xml_decl_handler(self, version, encoding, standalone):
        self.encoding = encoding

    def __parse(self, xml):
        parser = expat.ParserCreate()
        parser.XmlDeclHandler = self.__xml_decl_handler
        parser.Parse(xml)

And here is the example of its usage:

xml = """<?xml version='1.0' encoding='iso-8859-1'?>
    <book>
        <title>Title</title>
        <chapter>Chapter 1</chapter>
    </book>"""
parser = XmlParser()
encoding = parser.get_encoding(xml)

Collectives™ on Stack Overflow

Reading XML header encoding

3 Answers 3

4 Comments

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related