As your data is large, I assume you've already decided that you won't be able to load the whole lot into memory. This would be the approach using a DOM-style (document object model) parser. And you've actually tagged your question 'SAX' (the simple API for XML) which further implies you know that you need a non-memory approach.
Two approaches come to mind:
Using grep
Sometimes with XML it can be useful to use plain text processing tools. grep would allow you to filter through your XML document as plain text and find the occurrences of 2015:
$ grep -B 2 -A 1 "<Year>2015</Year>"
The -B and -A options instruct grep to print some lines of context around the match.
However, this approach will only work if your XML is also precisely structured plain text, which there's absolutely no need for it (as XML) to be. That is, your XML could have any combination of whitespace (or non at all) and still be semantically identical, but the grep approach depends on exact whitespace arrangement.
SAX
So a more reliable non-memory approach would be to use SAX. SAX implementations are conceptually quite simple, but a little tedious to write. Essentially, you have to override a class which provides methods that are called when certain 'events' occur in the source XML document. In the xml.sax.handler module in the standard library, this class is ContentHandler. These methods include:
- startElement
- endElement
- characters
Your overridden methods then determine how to handle those events. In a typical implementation of startElement(name, attrs) you might test the name argument to determine what the tag name of the element is. You might then maintain a stack of elements you have entered. When endElement(name) occurs, you might then pop the top element off that stack, and possibly do some processing on the completed element. The characters(content) happens when character data is encountered in the source document. In this method you might consider building up a string of the character data which can then be processed when you encounter an endElement.
So for your specific task, something like this may work:
from xml.sax import parse
from xml.sax.handler import ContentHandler
class filter2015(ContentHandler):
def __init__(self):
self.elements = [] # stack of elements
self.char_data = u'' # string buffer
self.current_customer = u'' # name of customer
self.current_year = u''
def startElement(self, name, attrs):
if name == u'Name':
self.elements.append(u'Name')
if name == u'Year':
self.elements.append(u'Year')
def characters(self, chars):
if len(self.elements) > 0 and self.elements[-1] in [u'Name', u'Year']:
self.char_data += chars
def endElement(self, name):
self.elements.pop() if len(self.elements) > 0 else None
if name == u'Name':
self.current_customer = self.char_data
self.char_data = ''
if name == u'Year':
self.current_year = self.char_data
self.char_data = ''
if name == 'Customer':
# wait to check the year until the Customer is closed
if self.current_year == u'2015':
print 'Found:', self.current_customer
# clear the buffers now that the Customer is finished
self.current_year = u''
self.current_customer = u''
self.char_data = u''
source = open('test.xml')
parse(source, filter2015())