to parse a file with text (with offset information) and binary data in python

Question

I have an xml file, which contains a set of textual element tags (each contains the decimal offset value and data length of the corresponding binary element) and the whole binary data of all the elements at the end. An example is as follows.

<?xml version="1.0" encoding="UTF-8"?>
<Package>
  <element>
        <offset>0</offset>
        <length>2961181</length>
        <checksum>4238515972</checksum>
        <format>gzip</format>
  </element>
  <element>
        <offset>2961181</offset>
        <length>5442</length>
        <checksum>4238515972</checksum>
        <format>bin</format>
  </element>
</Package>
BINARY_DATA

please note, the offset is decimal and counts from the first byte after the headers. How can I parse this file in python, grab the corresponding element based on the offset, uncompressed it (if its format is gzip) and store it as a file?

well, based on the replies from OmnipotentEntity and Jakob_B, I made the following short script, just to see if it works for the 1st element:

import zlib

f = open("file.xml", "r")
text = f.read()
position = text.find("</Package>\n")
headerSize=position+ len("</Package>\n") + 1 
offset=0
f.seek(headerSize + offset) 
length = 2961181
bin_data = f.read(length)
zipped=1
if (zipped):
  ungziped_str = zlib.decompressobj().decompress('x\x9c' + bin_data)
  print(ungziped_str)
f.close()

however, I got the following error:

Traceback (most recent call last): File "file_parse.py", line 11, in ? ungziped_str = zlib.decompressobj().decompress('x\x9c' + bin_data) zlib.error: Error -3 while decompressing: invalid block type

what is the problem? the input file is incorrect, or the code is incorrect?

If I run that on your test XML (the one that has BINARY_DATA after the XML) and set length=10 for testing, I get "INARY_DATA". Remember there are only three types of bugs in programming: unexpected inputs and off-by-one errors. — Spacedman
– Spacedman, Commented Nov 22, 2010 at 13:49
thank you, Spacedman, off-by-one error, i change to headerSize=position+ len("</Package>\n"), but still another error: ungziped_str = zlib.decompressobj().decompress('x\x9c' + bin_data) zlib.error: Error -3 while decompressing: invalid stored block lengths — pepero
– pepero, Commented Nov 22, 2010 at 14:09
Probably getting hard to debug without us having a file to play with. Preferably one that isn't toooooo big. — Spacedman
– Spacedman, Commented Nov 22, 2010 at 15:42

Spacedman · Accepted Answer · 2010-11-22 10:40:15Z

2

The trick is going to be stopping XML parsers from puking on the binary data. lxml lets you feed a line at a time to a parser, so you can watch for the last XML tag and stop there:

from lxml import etree

def process(filename):
    f = file(filename,"r")
    parser = etree.XMLParser()
    for l in f:
        parser.feed(l)
        if l=="</Package>\n":
            break
    return parser.close()

That returns an

r=process("junk.xml")
<Element Package at 9f14eb4>

which is an lxml object you can get the data out of. The second object's offset is here:

>>> r[1][0].text
'2961181'

and so on. That should be enough for you to make a workable solution. Beware the line ending on the Package tag though, there might be a better way to do that, this might not work if the file has a different line ending.

answered Nov 22, 2010 at 10:40

Spacedman

94.7k12 gold badges148 silver badges231 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

vonPetrushev Over a year ago

Yes! This is the correct way yo do it! I didn't know that feed the parser line-by-line. @pepero: is this a proprietary storage approach?

Jakob Bowyer · Accepted Answer · 2010-11-22 10:03:16Z

1

Why not run a search for the end tag using lxml? Then when the end tag is found just .seek() to that point and read binary data.

answered Nov 22, 2010 at 10:03

Jakob Bowyer

34.8k8 gold badges80 silver badges92 bronze badges

Comments

OmnipotentEntity · Accepted Answer · 2010-11-22 09:24:09Z

0

Determine header size.

Grab offset and data length using xml magic

import zlib
python.seek(headerSize+offset)
mydata = python.read(length)
if (zipped):
  ungziped_str = zlib.decompressobj().decompress('x\x9c' + mydata)

Then write to file as normal.

Source for gunzip magic http://codingrecipes.com/ungzip-a-string-in-python-gzinflate-in-python

answered Nov 22, 2010 at 9:24

OmnipotentEntity

17.3k6 gold badges69 silver badges103 bronze badges

2 Comments

vonPetrushev Over a year ago

As I understood, you don't have knowledge of offset before you parse the xml header. Also, in the above example, there are two binary parts as specified in the xml. The real problem is how to read and parse the xml - since we don't know where the xml ends.

pepero Over a year ago

hi, OmnipotentEntity, thank you for your answers. As vonPetrushev pointed out, no knowledge about the offset values, etc. so the xml probably should be first read as a string, and build a datastructure of the (offset, length) pairs of all elements, then comes your way to seek/unzip/write. If this is correct, what (offset, length) pairs part will be like?

Collectives™ on Stack Overflow

to parse a file with text (with offset information) and binary data in python

3 Answers 3

1 Comment

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related