5

I receive XML chunks from a server. Those chunks are not complete segments but could look for instance like this:

chunk1 = '<el a="1" b='
chunk2 = '"2"><sub c="'
chunk3 = '3">test</sub'
chunk4 = '></el><el d='
chunk5 = '"4" e="5"></'
chunk6 = 'el>'

How can I parse this stream, so that whenever one "el" element is complete a function is called?

So far I'm taking this approach (using ElementTree):

import xml.etree.ElementTree as ET

text = ""

def handle_message(msg):
    text += msg
    try:
        root = ET.fromstring("<root>" + text + "</root>")
        for el in list(root):
            handle_element(el)
        text = ""
        return True
    except ET.ParseError:
        return False

However, this approach doesn't really work, since it only calls handle_element when text contains by accident a well-formed XML document, but it cannot be guaranteed that this will ever be the case.

4
  • 1
    if you want incremental XML parsing, you are using the wrong module... you want xml.sax. attach it to a simple file-type object that buffers data from the other end, and i think you'll have what you want. etree and other DOM-type parsers expect to load the whole file at once and work with it atomically. or try BeautifulSoup, haven't tried it but think it's supposed to handle these cases. Commented Jul 25, 2014 at 14:14
  • Ok thanks, I have a look at those two. But just to be clear, I don't have access to "the other end". I just get those string xml pieces and that's all I have. Commented Jul 25, 2014 at 14:20
  • Those are extremely small chunks. Can you up the buffer size of the socket connection to (maybe) allow the entire message to be received at once? Commented Jul 25, 2014 at 14:51
  • @notorious This is just an example, in reality they are larger. But no, I cannot do anything to guarantee that a complete element will be transmitted at once. I also cannot guarantee that, if one element will be transmitted at once, the chunk doesn't contain any additional and incomplete content after that element. Commented Jul 25, 2014 at 15:01

2 Answers 2

3

You could perhaps use ET.iterparse to incrementally parse the chunks of XML:

import xml.etree.ElementTree as ET

chunks = iter([
    '<root>'
    '<el a="1" b=',
    '"2"><sub c="',
    '3">test</sub',
    '></el><el d=',
    '"4" e="5"></',
    'el>',
    '</root>'
    ])


class Source(object):
    def read(self, size):
        # Replace this with code that reads XML chunks from the server
        return next(chunks)

for event, elem in ET.iterparse(Source(), events=('end', )):
    if elem.tag == 'el':
        print(elem)
        # handle_element(elem)

yields

<Element 'el' at 0xb744f6cc>
<Element 'el' at 0xb744f84c>

The first argument to ET.iterparse is often a filename, or a io.BytesIO or StringIO object. It can however be any object that has a read method. Thus, if you create an object whose read method reads from the server, then you can hook it into ET.iterparse to do incremental parsing.

Note that ET.iterparse will call the read method with a requested number of bytes (e.g. read(16384)). You can return fewer bytes if that is all the server gives you, but I'm not sure if anything bad will happen if you return more than the requested number of bytes. Ideally, you should be able to pass along the requested number of bytes to the server, and rely on the server to serve the right number of bytes (or fewer).

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks. This is a great answer. I think I need to make some more modification to implement this correctly, but some first tests show, that it definitely does what I was looking for :)
0

You are trying to make an XML object before you have proper a XML sting (which I believe you've already figured out). Basically you have concatenate (join) all the strings/chunks together and once you have the complete XML, make an XML object using the complete string. Use a io.BytesIO or io.StringIO and whenever you get something from the server, write it to the buffer, then parse the buffer and take out what you need.

Twisted Example:

from io import StringIO

def __init__(self):
    self.buffer = StringIO()    # Buffer obj

def dataReceived(self, data):
    # this is data that is received from the server
    self.buffer.write( data )    # Usually want this in a callBack

def processBuffer(self):
    string = self.buffer.getvalue()
    ''' Do your parsing 
        Then once you have the complete xml
        do etree.fromstring( string ) or equivalant'''

Hope that helps, we do something very similar at work, but I can't remember exactly how we implemented it.

2 Comments

Thanks, but unfortunately that wouldn't work for me. I want to trigger a function whenever one single element is complete and not only when the entire document is complete. Also, there is no entire document. I just receive single xml pieces.
If you looking to parse each element individually, then the same buffer method (in some shape or form) must still apply. xml.sax may help, but I've never used it

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.