47

I need to load an XML file and convert the contents into an object-oriented Python structure. I want to take this:

<main>
    <object1 attr="name">content</object>
</main>

And turn it into something like this:

main
main.object1 = "content"
main.object1.attr = "name"

The XML data will have a more complicated structure than that and I can't hard code the element names. The attribute names need to be collected when parsing and used as the object properties.

How can I convert XML data into a Python object?

7 Answers 7

65

It's worth looking at lxml.objectify.

xml = """<main>
<object1 attr="name">content</object1>
<object1 attr="foo">contenbar</object1>
<test>me</test>
</main>"""

from lxml import objectify

main = objectify.fromstring(xml)
main.object1[0]             # content
main.object1[1]             # contenbar
main.object1[0].get("attr") # name
main.test                   # me

Or the other way around to build xml structures:

item = objectify.Element("item")
item.title = "Best of python"
item.price = 17.98
item.price.set("currency", "EUR")

order = objectify.Element("order")
order.append(item)
order.item.quantity = 3
order.price = sum(item.price * item.quantity for item in order.item)

import lxml.etree
print(lxml.etree.tostring(order, pretty_print=True))

Output:

<order>
  <item>
    <title>Best of python</title>
    <price currency="EUR">17.98</price>
    <quantity>3</quantity>
  </item>
  <price>53.94</price>
</order>
Sign up to request clarification or add additional context in comments.

4 Comments

When I run your generation example using lxml version 2.2 beta1, my XML is full of type annotations ("<title py:pytype="str">..."). Is there a way to supress that?
you can use lxml.etree.cleanup_namespaces(order)
You actually want to use both lxml.objectify.deannotate(order) and lxml.etree.cleanup_namespaces(order).
Beware of some vulnerabilities, in particular with objectif: bandit.readthedocs.io/en/1.7.9/blacklists/…
9

I've been recommending this more than once today, but try Beautiful Soup (easy_install BeautifulSoup).

from BeautifulSoup import BeautifulSoup

xml = """
<main>
    <object attr="name">content</object>
</main>
"""

soup = BeautifulSoup(xml)
# look in the main node for object's with attr=name, optionally look up attrs with regex
my_objects = soup.main.findAll("object", attrs={'attr':'name'})
for my_object in my_objects:
    # this will print a list of the contents of the tag
    print my_object.contents
    # if only text is inside the tag you can use this
    # print tag.string

4 Comments

main.findAll need to be soup.findAll, but that helped a bit. Still not exactly what I wanted--but I think I may have an idea of how to get it to work. It's going to be used in external py files that will be interpretted by the app, so I can probably just remap them before execution.
I fixed the bugs in the code and updated the xml. I simply copied the original code giving the in the question.
BeautifulSoup (BeutifulStoneSoup) breaks with empty tags <element />, e.g. <icon data="/ig/images/weather/partly_cloudy.gif"/> - and those are aplenty in xml :(
This should be updated to use BeautifulSoup4. The old version is no longer maintained, and is not compatible with Python 3.
4

David Mertz's gnosis.xml.objectify would seem to do this for you. Documentation's a bit hard to come by, but there are a few IBM articles on it, including this one (text only version).

from gnosis.xml import objectify

xml = "<root><nodes><node>node 1</node><node>node 2</node></nodes></root>"
root = objectify.make_instance(xml)

print root.nodes.node[0].PCDATA # node 1
print root.nodes.node[1].PCDATA # node 2

Creating xml from objects in this way is a different matter, though.

Comments

1

How about this

http://evanjones.ca/software/simplexmlparse.html

Comments

1
#@Stephen: 
#"can't hardcode the element names, so I need to collect them 
#at parse and use them somehow as the object names."

#I don't think thats possible. Instead you can do this. 
#this will help you getting any object with a required name.

import BeautifulSoup


class Coll(object):
    """A class which can hold your Foo clas objects 
    and retrieve them easily when you want
    abstracting the storage and retrieval logic
    """
    def __init__(self):
        self.foos={}        

    def add(self, fooobj):
        self.foos[fooobj.name]=fooobj

    def get(self, name):
        return self.foos[name]

class Foo(object):
    """The required class
    """
    def __init__(self, name, attr1=None, attr2=None):
        self.name=name
        self.attr1=attr1
        self.attr2=attr2

s="""<main>
         <object name="somename">
             <attr name="attr1">value1</attr>
             <attr name="attr2">value2</attr>
         </object>
         <object name="someothername">
             <attr name="attr1">value3</attr>
             <attr name="attr2">value4</attr>
         </object>
     </main>
"""

#

soup=BeautifulSoup.BeautifulSoup(s)


bars=Coll()
for each in soup.findAll('object'):
    bar=Foo(each['name'])
    attrs=each.findAll('attr')
    for attr in attrs:
        setattr(bar, attr['name'], attr.renderContents())
    bars.add(bar)


#retrieve objects by name
print bars.get('somename').__dict__

print '\n\n', bars.get('someothername').__dict__

output

{'attr2': 'value2', 'name': u'somename', 'attr1': 'value1'}


{'attr2': 'value4', 'name': u'someothername', 'attr1': 'value3'}

Comments

1

I would suggest xsData (https://xsdata.readthedocs.io/en/latest/)

# Parse XML
from pathlib import Path
from tests.fixtures.primer import PurchaseOrder
from xsdata.formats.dataclass.parsers import XmlParser

xml_string = Path("tests/fixtures/primer/sample.xml").read_text()
parser = XmlParser()
order = parser.from_string(xml_string, PurchaseOrder)
order.bill_to

Comments

0

There are three common XML parsers for python: xml.dom.minidom, elementree, and BeautifulSoup.

IMO, BeautifulSoup is by far the best.

http://www.crummy.com/software/BeautifulSoup/

1 Comment

BeautifulSoup does not play well with XML - it has problem with empty tags <element/> - which is ok for HTML because those are not popular there

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.