Process XML by chunks in Python

Question

I have a series of large XML files (~3GB each) that I'm trying to process. The rough format of the XML is

<FILE>
<DOC>
    <FIELD1>
        Some text.
    </FIELD1>
    <FIELD2>
        Some text. Probably some more fields nested within this one.
    </FIELD2>
    <FIELD3>
        Some text.
    </FIELD3>
    <FIELD4>
        Some text. Etc.
    </FIELD4>
</DOC>
<DOC>
    <FIELD1>
        Some text.
    </FIELD1>
    <FIELD2>
        Some text. Probably some more fields nested within this one.
    </FIELD2>
    <FIELD3>
        Some text.
    </FIELD3>
    <FIELD4>
        Some text. Etc.
    </FIELD4>
</DOC>
</FILE>

My current approach is (mimicking the code seen at http://effbot.org/zone/element-iterparse.htm#incremental-parsing):

#Added this in the edit.
import xml.etree.ElementTree as ET

tree = ET.iterparse(xml_file)
tree = iter(tree)
event, root = tree.next()

for event, elem in tree:
    #Need to find the <DOC> elements
    if event == "end" and elem.tag == "DOC":
        #Code to process the fields within the <DOC> element. 
        #The code here mainly just iterates through the inner 
        #elements and extracts what I need
        root.clear()

This blows up, though, and uses all of my system memory (16GB). At first I thought it was the position of the root.clear() so I tried moving that out to after the if-statement, but that didn't seem to have any effect. Given this, I'm note quite sure how to proceed other than "get more memory."

EDIT:

Deleted the previous edit because it was wrong.

If there is that much data xml may not be the most efficient way to store it — 735Tesla
– 735Tesla, Commented Jan 4, 2014 at 21:54
What are you trying to accomplish? Consider one of the libraries like one of the above that allow you to parse XML files without loading the whole thing into memory, since it seems that all you need is to extract data. — MxLDevs
– MxLDevs, Commented Jan 4, 2014 at 22:50

ChrisP · Accepted Answer · 2014-01-05 00:43:39Z

4

I think you can use the code you've already written if you switch to lxml and do this to clear out the tree...

from lxml import etree
context = etree.iterparse(xmlfile)  # can also limit to certain events and tags
for event, elem in context:
    # do some stuff here with elem
    elem.clear()
    while elem.getprevious() is not None:
        del elem.getparent()[0]

I'm not claiming this is efficient, but it might get the job done.

edited Jan 5, 2014 at 0:43

answered Jan 4, 2014 at 22:57

ChrisP

5,9721 gold badge36 silver badges36 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

user1074057 Over a year ago

The elem.getprevious() fails with an error that elem doesn't have the attribute of getprevious. I've also added more info in the edit to the OP.

ChrisP Over a year ago

Apologies, I checked my code and I am using lxml. I have edited my response to reflect that. There might be an equivalent, however, in xml.etree.

user1074057 Over a year ago

It looks like the elem.clear() has done it. I haven't had a chance to run through the entire file, but the memory usage looks like its holding steady.

Collectives™ on Stack Overflow

Process XML by chunks in Python

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related