Iteratively write XML nodes in python

Question

There are many ways to read XML, both all-at-once (DOM) and one-bit-at-a-time (SAX). I have used SAX or lxml to iteratively read large XML files (e.g. wikipedia dump which is 6.5GB compressed).

However after doing some iterative processing (in python using ElementTree) of that XML file, I want to write out the (new) XML data to another file.

Are there any libraries to iteratively write out XML data? I could create the XML tree and then write it out, but that is not possible without oodles of ram. Is there anyway to write the XML tree to a file iteratively? One bit at a time?

I know I could generate the XML myself with print "<%s>" % tag_name, etc., but that seems a bit... hacky.

possible duplicate of What's the easiest non-memory intensive way to output XML from Python? — mmmmmm
– mmmmmm, Commented Jan 6, 2012 at 15:40

samplebias · Accepted Answer · 2011-03-21 15:28:37Z

4

Fredrik Lundh's elementtree.SimpleXMLWriter will let you write out XML incrementally. Here's the demo code embedded in the module:

from elementtree.SimpleXMLWriter import XMLWriter
import sys

w = XMLWriter(sys.stdout)

html = w.start("html")

w.start("head")
w.element("title", "my document")
w.element("meta", name="generator", value="my application 1.0")
w.end()

w.start("body")
w.element("h1", "this is a heading")
w.element("p", "this is a paragraph")

w.start("p")
w.data("this is ")
w.element("b", "bold")
w.data(" and ")
w.element("i", "italic")
w.data(".")
w.end("p")

w.close(html)

answered Mar 21, 2011 at 15:28

samplebias

38.1k6 gold badges111 silver badges104 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

unutbu · Accepted Answer · 2011-03-21 13:25:25Z

1

With lxml, you can use etree.Element to create new nodes, and etree.tostring to write out the XML representation. See, for example, Listing 6. Serialize an element's children from Liza Daly's article "High-performance XML parsing in Python with lxml".

answered Mar 21, 2011 at 13:25

unutbu

886k197 gold badges1.9k silver badges1.7k bronze badges

2 Comments

Amandasaurus Over a year ago

Do I need to have the whole tree in memory in order to use etree.tostring? If so, this is a non-runner.

unutbu Over a year ago

@Rory: The fast_iter function steps through the nodes without generating the entire DOM. You can then modify some or all of those nodes one-at-atime, and write them out with etree.tostring.

xtofl · Accepted Answer · 2011-03-21 13:28:09Z

1

If you're reading in XML dialect1, and have to write XML dialect2, wouldn't it be a good idea to write down the conversion process using xslt? You may not even need any source code that way.

answered Mar 21, 2011 at 13:28

xtofl

41.7k13 gold badges112 silver badges203 bronze badges

2 Comments

Amandasaurus Over a year ago

I don't mind about programming or not programming. The most important thing is the memory consumption. I do not have space to store all the source document in memory. Isn't XSLT memory intensive? (i.e. isn't that why STX was invented?)

xtofl Over a year ago

@Rory: xslt allows you to specify the transformations in a declarative way. Then you can apply the transformation using the tools at hand. I must admit I didn't have to bother about scalability yet. I suggest you take a look at the available XSLT Processors. A first look tells me that e.g. Saxon has a 'lazy construction' mode (saxonica.com/documentation/javadoc/net/sf/saxon/lib/…)

jsbueno · Accepted Answer · 2011-03-21 13:31:31Z

If you don't find anything else, what I'd prefer here is to inherit from ElementTree and create a "iteractiveElementTree", adding to it a "file" attribute. I'd subclasse the nodes to have a "start_tag_comitted" attribute and a "commit" method. Upon being called, this "commit" method would call the render method for a subtree - starting from the fartest parent where e"start_tag_comitted" is false. With the string in hand I'd manually strip the closing tags for the parents of the current node. There is the need to handle the previously oppened but not closed parent siblings as well.

Then, I'd remove the "commited" node from the memory model. You will need to anotate node parents to each node as well, as ElementTree does not do that.

(Write me if there are no better answers an dyou get stuck there, I could implement this)

Collectives™ on Stack Overflow

Iteratively write XML nodes in python

4 Answers 4

Comments

2 Comments

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

2 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related