I have a series of large XML files (~3GB each) that I'm trying to process. The rough format of the XML is
<FILE>
<DOC>
<FIELD1>
Some text.
</FIELD1>
<FIELD2>
Some text. Probably some more fields nested within this one.
</FIELD2>
<FIELD3>
Some text.
</FIELD3>
<FIELD4>
Some text. Etc.
</FIELD4>
</DOC>
<DOC>
<FIELD1>
Some text.
</FIELD1>
<FIELD2>
Some text. Probably some more fields nested within this one.
</FIELD2>
<FIELD3>
Some text.
</FIELD3>
<FIELD4>
Some text. Etc.
</FIELD4>
</DOC>
</FILE>
My current approach is (mimicking the code seen at http://effbot.org/zone/element-iterparse.htm#incremental-parsing):
#Added this in the edit.
import xml.etree.ElementTree as ET
tree = ET.iterparse(xml_file)
tree = iter(tree)
event, root = tree.next()
for event, elem in tree:
#Need to find the <DOC> elements
if event == "end" and elem.tag == "DOC":
#Code to process the fields within the <DOC> element.
#The code here mainly just iterates through the inner
#elements and extracts what I need
root.clear()
This blows up, though, and uses all of my system memory (16GB). At first I thought it was the position of the root.clear() so I tried moving that out to after the if-statement, but that didn't seem to have any effect. Given this, I'm note quite sure how to proceed other than "get more memory."
EDIT:
Deleted the previous edit because it was wrong.
lxml?