How to read XML header in Python

Question

How can I read the header of an XML document in Python 3?

Ideally, I would use the defusedxml module as the documentation states that it's safer, but at this point (after hours of trying to figure this out), I'd settle for any parser.

For example, I have a document (this is actually from an exercise) that looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple Computer//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0"> <!-- this is root -->
    <!-- CONTENTS -->
</plist>

I'm wondering how to access everything before the root node.

This seems like such a general question that I thought I would easily find an answer online, but I guess I was wrong. The closest thing I found was this question on Stack Overflow, which didn't really help (I looked into xml.sax, but couldn't find anything relevant).

qwermike · Accepted Answer · 2018-02-23 19:19:40Z

5

I tried minidom which is vulnerable to billion laughs and quadratic blowup attacks according to link you provided. Here is my code:

from xml.dom.minidom import parse

dom = parse('file.xml')
print('<?xml version="{}" encoding="{}"?>'.format(dom.version, dom.encoding))
print(dom.doctype.toxml())
#or
print(dom.getElementsByTagName('plist')[0].previousSibling.toxml())
#or
print(dom.childNodes[0].toxml())

Output:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist  PUBLIC '-//Apple Computer//DTD PLIST 1.0//EN'  'http://www.apple.com/DTDs/PropertyList-1.0.dtd'>
<!DOCTYPE plist  PUBLIC '-//Apple Computer//DTD PLIST 1.0//EN'  'http://www.apple.com/DTDs/PropertyList-1.0.dtd'>
<!DOCTYPE plist  PUBLIC '-//Apple Computer//DTD PLIST 1.0//EN'  'http://www.apple.com/DTDs/PropertyList-1.0.dtd'>

You can use minidom from defusedxml. I downloaded that package and just replaced import with from defusedxml.minidom import parse and code worked with same output.

edited Feb 23, 2018 at 19:19

answered Feb 23, 2018 at 15:50

qwermike

1,4962 gold badges13 silver badges27 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Ratler Over a year ago

Brilliant! That's exactly what I was looking for. The third option (childNodes[0]) seems to be the most generic for getting all headers.

qwermike Over a year ago

I'm glad I helped :-)

mzjn · Accepted Answer · 2018-02-23 16:46:08Z

5

With the lxml library, you can access document properties via a DocInfo object.

from lxml import etree

tree = etree.parse('input.xml')
info = tree.docinfo
v, e, d = info.xml_version, info.encoding, info.doctype

print('<?xml version="{}" encoding="{}"?>'.format(v, e))
print(d)

Output:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple Computer//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">

answered Feb 23, 2018 at 16:46

mzjn

51.5k16 gold badges139 silver badges265 bronze badges

1 Comment

Ratler Over a year ago

Thanks! This works perfectly well, but I've accepted @mike-kaskun's answer because (a) of defusedxml and (b) minidom seems to be a default package (at least on my system) vs lxml which I had to install.

Usman · Accepted Answer · 2018-02-23 05:35:51Z

0

Try this code ! I am assuming the temporary xml in variable 's' .

I am declare a class of MyParser having a function of XmlDecl to print the XML header & the purpose of second function is to parse the XML header .so first create the parser by using the ParserCreate() function defined in xml.parsers .

Now create the object of MyParser class 'parser' & call the parse function with the object reference.

from xml.parsers import expat

s = """<?xml version='1.0' encoding='iso-8859-1'?>
       <book>
           <title>Title</title>
           <chapter>Chapter 1</chapter>
       </book>"""

class MyParser(object):
    def XmlDecl(self, version, encoding, standalone):
        print ("XmlDecl", version, encoding, standalone)

    def Parse(self, data):
        Parser = expat.ParserCreate()
        Parser.XmlDeclHandler = self.XmlDecl
        Parser.Parse(data, 1)

parser = MyParser()
parser.Parse(s)

edited Feb 23, 2018 at 5:35

answered Feb 23, 2018 at 4:26

Usman

2,0292 gold badges18 silver badges30 bronze badges

3 Comments

Ratler Over a year ago

Thanks, but see clarification in question. Also, I find it difficult to follow your code; maybe some comments or simplifications would help.

Usman Over a year ago

yes sure ! I am updating the description above in a while @Ratler

Ratler Over a year ago

that doesn't really help actually. And it's still not getting the full headers before the root node.

Collectives™ on Stack Overflow

How to read XML header in Python

3 Answers 3

2 Comments

1 Comment

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

1 Comment

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related