2

I have an XML file 7 GB, It is about all transactions in one company, and I want to filter only the records of the Last Year (2015). The structure of a file is:

<Customer>
<Name>A</Name>
<Year>2015<Year>
</Customer>

I have also its DTD file. I do not know how can I filter such data into text file. Is there any tutorial or library to be used in this regard.

Welcome!

2
  • Possible duplicate of Prune some elements from large xml file Commented May 2, 2016 at 20:00
  • how much memory can you muster for processing this XML file? There may be other options available depending on this number? Commented May 26, 2016 at 22:11

2 Answers 2

3

As your data is large, I assume you've already decided that you won't be able to load the whole lot into memory. This would be the approach using a DOM-style (document object model) parser. And you've actually tagged your question 'SAX' (the simple API for XML) which further implies you know that you need a non-memory approach.

Two approaches come to mind:

Using grep

Sometimes with XML it can be useful to use plain text processing tools. grep would allow you to filter through your XML document as plain text and find the occurrences of 2015:

$ grep -B 2 -A 1 "<Year>2015</Year>"

The -B and -A options instruct grep to print some lines of context around the match.

However, this approach will only work if your XML is also precisely structured plain text, which there's absolutely no need for it (as XML) to be. That is, your XML could have any combination of whitespace (or non at all) and still be semantically identical, but the grep approach depends on exact whitespace arrangement.

SAX

So a more reliable non-memory approach would be to use SAX. SAX implementations are conceptually quite simple, but a little tedious to write. Essentially, you have to override a class which provides methods that are called when certain 'events' occur in the source XML document. In the xml.sax.handler module in the standard library, this class is ContentHandler. These methods include:

  • startElement
  • endElement
  • characters

Your overridden methods then determine how to handle those events. In a typical implementation of startElement(name, attrs) you might test the name argument to determine what the tag name of the element is. You might then maintain a stack of elements you have entered. When endElement(name) occurs, you might then pop the top element off that stack, and possibly do some processing on the completed element. The characters(content) happens when character data is encountered in the source document. In this method you might consider building up a string of the character data which can then be processed when you encounter an endElement.

So for your specific task, something like this may work:

from xml.sax import parse
from xml.sax.handler import ContentHandler

class filter2015(ContentHandler):
    def __init__(self):
        self.elements = []          # stack of elements
        self.char_data = u''        # string buffer
        self.current_customer = u'' # name of customer
        self.current_year = u''

    def startElement(self, name, attrs):
        if name == u'Name':
            self.elements.append(u'Name')
        if name == u'Year':
            self.elements.append(u'Year')

    def characters(self, chars):
        if len(self.elements) > 0 and self.elements[-1] in [u'Name', u'Year']:
            self.char_data += chars

    def endElement(self, name):
        self.elements.pop() if len(self.elements) > 0 else None

        if name == u'Name':
            self.current_customer = self.char_data
            self.char_data = ''
        if name == u'Year':
            self.current_year = self.char_data
            self.char_data = ''

        if name == 'Customer':
            # wait to check the year until the Customer is closed
            if self.current_year == u'2015':
                print 'Found:', self.current_customer

            # clear the buffers now that the Customer is finished
            self.current_year = u''
            self.current_customer = u''
            self.char_data = u''

source = open('test.xml')
parse(source, filter2015())
Sign up to request clarification or add additional context in comments.

6 Comments

Great Answer, :) Now if I want to save the result to csv file, does it overflow the memory, because write operation still needed?!
Is it possible to use XPath with SAX ?!
I would suggest that writing large CSV files should be the subject of another question (probably already answered). Basically, careful use of flush might be what you need.
Whether or not you can use XPath with SAX is just a feature of the XPath processor you're using. The Java/C# Saxon XSLT processor implements XPath 1.0 and 2.0 (as a conformant XSLT 2.0 implementation) and is written using the SAX API.
According to the code above, I have a problem, what if there is two tags named [year]. How can I retrieve both tags.
|
1

Check out this question. It will let you interact with it as a generator:

python: is there an XML parser implemented as a generator?

You want to use a generator so that you don't load the entire doc into memory first.

Specifically:

import xml.etree.cElementTree as ET

for event, element in ET.iterparse('huge.xml'):
    if event == 'end' and element.tag == 'ticket':
        #process ticket...

Source: http://enginerds.craftsy.com/blog/2014/04/parsing-large-xml-files-in-python-without-a-billion-gigs-of-ram.html

2 Comments

This is not a good solution, because I have also DTD file, which describes some fields type, and it is important for parsing specially CDATA, I got the following Error: ------------------ Name File "<string>", line unknown ParseError: undefined entity &ouml;: line 47, column 18 --------------
This use of ElementTree is basically the same principle as SAX, just with a cleaner API. I might actually prefer this to my SAX answer :-)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.