4

I have a series of large XML files (~3GB each) that I'm trying to process. The rough format of the XML is

<FILE>
<DOC>
    <FIELD1>
        Some text.
    </FIELD1>
    <FIELD2>
        Some text. Probably some more fields nested within this one.
    </FIELD2>
    <FIELD3>
        Some text.
    </FIELD3>
    <FIELD4>
        Some text. Etc.
    </FIELD4>
</DOC>
<DOC>
    <FIELD1>
        Some text.
    </FIELD1>
    <FIELD2>
        Some text. Probably some more fields nested within this one.
    </FIELD2>
    <FIELD3>
        Some text.
    </FIELD3>
    <FIELD4>
        Some text. Etc.
    </FIELD4>
</DOC>
</FILE>

My current approach is (mimicking the code seen at http://effbot.org/zone/element-iterparse.htm#incremental-parsing):

#Added this in the edit.
import xml.etree.ElementTree as ET

tree = ET.iterparse(xml_file)
tree = iter(tree)
event, root = tree.next()

for event, elem in tree:
    #Need to find the <DOC> elements
    if event == "end" and elem.tag == "DOC":
        #Code to process the fields within the <DOC> element. 
        #The code here mainly just iterates through the inner 
        #elements and extracts what I need
        root.clear()

This blows up, though, and uses all of my system memory (16GB). At first I thought it was the position of the root.clear() so I tried moving that out to after the if-statement, but that didn't seem to have any effect. Given this, I'm note quite sure how to proceed other than "get more memory."

EDIT:

Deleted the previous edit because it was wrong.

6
  • If there is that much data xml may not be the most efficient way to store it Commented Jan 4, 2014 at 21:54
  • have you tried using lxml? Commented Jan 4, 2014 at 22:08
  • Have you tried using SAX? Commented Jan 4, 2014 at 22:31
  • What are you trying to accomplish? Consider one of the libraries like one of the above that allow you to parse XML files without loading the whole thing into memory, since it seems that all you need is to extract data. Commented Jan 4, 2014 at 22:50
  • @735Tesla, I agree. The data has been given to me, though. Commented Jan 5, 2014 at 0:11

1 Answer 1

4

I think you can use the code you've already written if you switch to lxml and do this to clear out the tree...

from lxml import etree
context = etree.iterparse(xmlfile)  # can also limit to certain events and tags
for event, elem in context:
    # do some stuff here with elem
    elem.clear()
    while elem.getprevious() is not None:
        del elem.getparent()[0]

I'm not claiming this is efficient, but it might get the job done.

Sign up to request clarification or add additional context in comments.

3 Comments

The elem.getprevious() fails with an error that elem doesn't have the attribute of getprevious. I've also added more info in the edit to the OP.
Apologies, I checked my code and I am using lxml. I have edited my response to reflect that. There might be an equivalent, however, in xml.etree.
It looks like the elem.clear() has done it. I haven't had a chance to run through the entire file, but the memory usage looks like its holding steady.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.