2

I have a XML file, which is about 30MB, there are about 300000 element in it.

I use the following code to process this file.

xmldoc=xml.dom.minidom.parse("badges.xml")

csv_out=open("badge.csv","w")

for badge in xmldoc.getElementsByTagName("row"):
    some processing here
    csv_out.write(line)

The file is only 30MB, but when I run this script on my MBP (10.7, 8G RAM), it uses nearly 3GB memory. Why such simple script and such small file use so much memory?

Best Regards,

4
  • 1
    How are you measuring memory usage? Commented Sep 6, 2012 at 15:28
  • Try it with a reasonable parser like lxml. Commented Sep 6, 2012 at 15:35
  • minidom is not a parser, it is prototype-level crap Commented Sep 6, 2012 at 15:52
  • It would be helpful to see the 'some processing here' code too. Commented Sep 6, 2012 at 16:00

3 Answers 3

5

You'll need to switch to an iterative parser, which processes XML statements in chunks, allowing you to clear up memory in between. The DOM parser loads the whole document into memory in one go.

The standard library has both a SAX parser and ElementTree.iterparse options available for you.

Quick iterparse example:

from xml.etree.ElementTree import iterparse

with open("badge.csv","w") as csvout:
    for event, elem in iterparse("badges.xml"):
        if event == 'end' and elem.tag == 'row': # Complete row tag
            # some processing here
            csv_out.write(line)
            elem.clear()

Note the .clear() call; that frees up the element and removes it from memory.

Sign up to request clarification or add additional context in comments.

1 Comment

SAX parsers have limited functionality e.g. don't provide support for xpath which is often needed for serious processing of XML. SAX parsers are not a general solution here.
0

DOM-type XML parsers can use a lot of memory since they load the whole document. 3GB seems a more than a little excessive for a 30MB file, so there is likely something else going on.

However, you might want to consider a SAX-style XML parser (xml.sax in Python). In this type of parser, your code sees each element (tag, text, etc.) via a callback as the parser processes it. A SAX-style parser retains no document structure; indeed, nothing but a single XML element is ever considered. For this reason it's fast and memory-efficient. It can be a pain to work with if your parsing needs are complex, but it seems like yours are pretty straightforward.

Comments

0

I use lxml on very large xml files and never have any problems.

See this stackoverflow article for help installing, as I had to do this on my ubuntu system:

pip install lxml error

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.