0

Further to this question: Handling and working with binary data HEX with python (and thanks to awesome pointers I received) I'm stuck on one last aspect of tool.

I am basically writing a cleaner for files that I have with data past the EOF marker. This extra data means they fail some validation tools. I need to strip the extra data, so they be presented to the validator, however I don't want to throw this data away (in fact I have to keep it...)

I've written an XML container to hold the data, and a few other provenance/audit type values, but I'm (still) stuck on elegantly moving between raw binary and something I can "bake" in to a file.

example:

A jpg file ends with (hex editor view) 96 1a 9c fd ab 4f 9e 69 27 ad fd da 0a db 76 bb ee d2 6a fd ff 00 ff d9 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff

The EOF marker for jpg is ff d9, so the cleaner works backwards through the file until its a match against the EOF marker. In this case it would create a new jpg file stopping at the ff d9 and then attempt to write the stripped data to the XML (via the elementTree lib): changeString.text =str(excessData)

Of course this wont work as the XML writer is looking to write ASCII not binary dumps.

In the above case, the error is UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128) which I can see if because its not a valid ASCII character

My question therefore, is how do I elegantly deal with this raw data, in a way it can stored and used in the future? (I plan to write an 'uncleaner' next that can take the clean file and the XML and reconstruct the original file...)

______EDIT_______

Using the suggestions from below, this is the traceback:

Traceback (most recent call last):
  File "C:\...\EOF_cleaner\scripts\test6.py", line 87, in <module> main()
  File "C:\...\EOF_cleaner\scripts\test6.py", line 73, in main splitFile(f_data, offset)
  File "C:\...EOF_cleaner\scripts\test6.py", line 60, in splitFile makeXML(excessData)
  File "C:\...\EOF_cleaner\scripts\test6.py", line 53 in makeXML ET.ElementTree(root).write(noteFile)
  File "c:\python27\lib\xml\etree\ElementTree.py", line 815, in write serialize(write, self._root, encoding, qnames, namespaces)
  File "c:\python27\lib\xml\etree\ElementTree.py", line 934, in _serialize_xml_serialize_xml(write, e, encoding, qnames, None)
  File "c:\python27\lib\xml\etree\ElementTree.py", line 934, in _serialize_xml_serialize_xml(write, e, encoding, qnames, None)
  File "c:\python27\lib\xml\etree\ElementTree.py", line 934, in _serialize_xml_serialize_xml(write, e, encoding, qnames, None)
  File "c:\python27\lib\xml\etree\ElementTree.py", line 932, in _serialize_xml write(_escape_cdata(text, encoding))
  File "c:\python27\lib\xml\etree\ElementTree.py", line 1068, in _escape_cdata  return text.encode(encoding, "xmlcharrefreplace")
  UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)

The line that throws things is changeString.text = excessData.encode('base64') (line 45) and ET.ElementTree(root).write(noteFile) (line 53)

4
  • Tried to write unicode(str(excessData)) ? Commented Sep 25, 2012 at 22:42
  • @MahmoudAladdin Thanks, but that gives the same error! Commented Sep 25, 2012 at 23:03
  • Your traceback shows that something else is adding binary data to your tree too. If you disable the changeString.text line altogether, do you still get the error? Commented Sep 25, 2012 at 23:04
  • @MartijnPieters You are of course correct :( Time to go hunting... Commented Sep 25, 2012 at 23:06

2 Answers 2

4

Use Base64:

excessData.encode('base64')

It'll be easy to turn that back to binary data later on with a simple .decode('base64') call.

Base64 encodes to ASCII data safe for inclusion in XML, in a reasonably compact format; every 3 bytes of binary information become 4 Base64 characters.

Sign up to request clarification or add additional context in comments.

10 Comments

Thank you, That looks simple enough... I tried placing that line verbatim after the line that declares the excessData variable ( excessData = f_data[offset:] but I still get the UnicodeDecode error. Do I need to do something else?
@JayGattuso, the encode doesn't modify the data in-place, it returns a new string. You probably want excessData = f_data[offset:].encode('base64')
changeString.text = excessData.encode('base64') should work; .encode() returns the encoded value, it doesn't change the string in-place.
Hmm, OK, so now I'm excessData = excessData.encode('base64'), however I still get the error - I suspect that the changeString.text = excessData may be trying to ASCIIify the variable?
No, it is trying to unicodify it, and that should just work since Base64 is all ASCII; all \xff bytes is a series of / slashes in Base64. What is the code and the traceback?
|
1

To convert raw bytes to space-separated ASCII hex, you can use something like:

>>> a = "abc\x01\x02"
>>> print(" ".join("{:02x}".format(x) for x in a))
61 62 63 01 02

However, as mentioned in other answers, something like Base64 is probably going to be more efficient and easier to work with.

1 Comment

nice, it do the same result than binascii.hexlify(a)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.