How to perform a query on Big XML file using Python?

Question

I have an XML file 7 GB, It is about all transactions in one company, and I want to filter only the records of the Last Year (2015). The structure of a file is:

<Customer>
<Name>A</Name>
<Year>2015<Year>
</Customer>

I have also its DTD file. I do not know how can I filter such data into text file. Is there any tutorial or library to be used in this regard.

Welcome!

Possible duplicate of Prune some elements from large xml file — Jan Vlcinsky
– Jan Vlcinsky, Commented May 2, 2016 at 20:00
how much memory can you muster for processing this XML file? There may be other options available depending on this number? — vtd-xml-author
– vtd-xml-author, Commented May 26, 2016 at 22:11

ironchicken · Accepted Answer · 2016-05-02 20:27:10Z

3

As your data is large, I assume you've already decided that you won't be able to load the whole lot into memory. This would be the approach using a DOM-style (document object model) parser. And you've actually tagged your question 'SAX' (the simple API for XML) which further implies you know that you need a non-memory approach.

Two approaches come to mind:

Using grep

Sometimes with XML it can be useful to use plain text processing tools. grep would allow you to filter through your XML document as plain text and find the occurrences of 2015:

$ grep -B 2 -A 1 "<Year>2015</Year>"

The -B and -A options instruct grep to print some lines of context around the match.

However, this approach will only work if your XML is also precisely structured plain text, which there's absolutely no need for it (as XML) to be. That is, your XML could have any combination of whitespace (or non at all) and still be semantically identical, but the grep approach depends on exact whitespace arrangement.

SAX

So a more reliable non-memory approach would be to use SAX. SAX implementations are conceptually quite simple, but a little tedious to write. Essentially, you have to override a class which provides methods that are called when certain 'events' occur in the source XML document. In the xml.sax.handler module in the standard library, this class is ContentHandler. These methods include:

startElement
endElement
characters

Your overridden methods then determine how to handle those events. In a typical implementation of startElement(name, attrs) you might test the name argument to determine what the tag name of the element is. You might then maintain a stack of elements you have entered. When endElement(name) occurs, you might then pop the top element off that stack, and possibly do some processing on the completed element. The characters(content) happens when character data is encountered in the source document. In this method you might consider building up a string of the character data which can then be processed when you encounter an endElement.

So for your specific task, something like this may work:

from xml.sax import parse
from xml.sax.handler import ContentHandler

class filter2015(ContentHandler):
    def __init__(self):
        self.elements = []          # stack of elements
        self.char_data = u''        # string buffer
        self.current_customer = u'' # name of customer
        self.current_year = u''

    def startElement(self, name, attrs):
        if name == u'Name':
            self.elements.append(u'Name')
        if name == u'Year':
            self.elements.append(u'Year')

    def characters(self, chars):
        if len(self.elements) > 0 and self.elements[-1] in [u'Name', u'Year']:
            self.char_data += chars

    def endElement(self, name):
        self.elements.pop() if len(self.elements) > 0 else None

        if name == u'Name':
            self.current_customer = self.char_data
            self.char_data = ''
        if name == u'Year':
            self.current_year = self.char_data
            self.char_data = ''

        if name == 'Customer':
            # wait to check the year until the Customer is closed
            if self.current_year == u'2015':
                print 'Found:', self.current_customer

            # clear the buffers now that the Customer is finished
            self.current_year = u''
            self.current_customer = u''
            self.char_data = u''

source = open('test.xml')
parse(source, filter2015())

answered May 2, 2016 at 20:27

ironchicken

7948 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Fawzi Belal Over a year ago

Great Answer, :) Now if I want to save the result to csv file, does it overflow the memory, because write operation still needed?!

Fawzi Belal Over a year ago

Is it possible to use XPath with SAX ?!

ironchicken Over a year ago

I would suggest that writing large CSV files should be the subject of another question (probably already answered). Basically, careful use of flush might be what you need.

ironchicken Over a year ago

Whether or not you can use XPath with SAX is just a feature of the XPath processor you're using. The Java/C# Saxon XSLT processor implements XPath 1.0 and 2.0 (as a conformant XSLT 2.0 implementation) and is written using the SAX API.

Fawzi Belal Over a year ago

According to the code above, I have a problem, what if there is two tags named [year]. How can I retrieve both tags.

|

Community · Accepted Answer · 2017-05-23 12:15:53Z

1

Check out this question. It will let you interact with it as a generator:

python: is there an XML parser implemented as a generator?

You want to use a generator so that you don't load the entire doc into memory first.

Specifically:

import xml.etree.cElementTree as ET

for event, element in ET.iterparse('huge.xml'):
    if event == 'end' and element.tag == 'ticket':
        #process ticket...

Source: http://enginerds.craftsy.com/blog/2014/04/parsing-large-xml-files-in-python-without-a-billion-gigs-of-ram.html

edited May 23, 2017 at 12:15

CommunityBot

11 silver badge

answered May 2, 2016 at 19:47

Kelvin

1,3672 gold badges13 silver badges23 bronze badges

2 Comments

Fawzi Belal Over a year ago

This is not a good solution, because I have also DTD file, which describes some fields type, and it is important for parsing specially CDATA, I got the following Error: ------------------ Name File "<string>", line unknown ParseError: undefined entity ö: line 47, column 18 --------------

ironchicken Over a year ago

This use of ElementTree is basically the same principle as SAX, just with a cleaner API. I might actually prefer this to my SAX answer :-)

Collectives™ on Stack Overflow

How to perform a query on Big XML file using Python?

2 Answers 2

Using grep

SAX

6 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Using grep

SAX

6 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related