Processing XML file in python using too much RAM

Question

I have a XML file, which is about 30MB, there are about 300000 element in it.

I use the following code to process this file.

xmldoc=xml.dom.minidom.parse("badges.xml")

csv_out=open("badge.csv","w")

for badge in xmldoc.getElementsByTagName("row"):
    some processing here
    csv_out.write(line)

The file is only 30MB, but when I run this script on my MBP (10.7, 8G RAM), it uses nearly 3GB memory. Why such simple script and such small file use so much memory?

Best Regards,

It would be helpful to see the 'some processing here' code too. — George
– George, Commented Sep 6, 2012 at 16:00

Martijn Pieters · Accepted Answer · 2012-09-06 15:39:08Z

5

You'll need to switch to an iterative parser, which processes XML statements in chunks, allowing you to clear up memory in between. The DOM parser loads the whole document into memory in one go.

The standard library has both a SAX parser and ElementTree.iterparse options available for you.

Quick iterparse example:

from xml.etree.ElementTree import iterparse

with open("badge.csv","w") as csvout:
    for event, elem in iterparse("badges.xml"):
        if event == 'end' and elem.tag == 'row': # Complete row tag
            # some processing here
            csv_out.write(line)
            elem.clear()

Note the .clear() call; that frees up the element and removes it from memory.

edited Sep 6, 2012 at 15:39

answered Sep 6, 2012 at 15:31

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

user2665694 Over a year ago

SAX parsers have limited functionality e.g. don't provide support for xpath which is often needed for serious processing of XML. SAX parsers are not a general solution here.

kindall · Accepted Answer · 2012-09-06 15:35:14Z

0

DOM-type XML parsers can use a lot of memory since they load the whole document. 3GB seems a more than a little excessive for a 30MB file, so there is likely something else going on.

However, you might want to consider a SAX-style XML parser (xml.sax in Python). In this type of parser, your code sees each element (tag, text, etc.) via a callback as the parser processes it. A SAX-style parser retains no document structure; indeed, nothing but a single XML element is ever considered. For this reason it's fast and memory-efficient. It can be a pain to work with if your parsing needs are complex, but it seems like yours are pretty straightforward.

answered Sep 6, 2012 at 15:35

kindall

185k36 gold badges291 silver badges321 bronze badges

Comments

Community · Accepted Answer · 2017-05-23 12:20:11Z

0

I use lxml on very large xml files and never have any problems.

See this stackoverflow article for help installing, as I had to do this on my ubuntu system:

pip install lxml error

edited May 23, 2017 at 12:20

CommunityBot

11 silver badge

answered Sep 6, 2012 at 15:49

dustin999

2834 silver badges13 bronze badges

Collectives™ on Stack Overflow

Processing XML file in python using too much RAM

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related