Python - convert a raw binary dump into ASCII HEX bytes

Question

Further to this question: Handling and working with binary data HEX with python (and thanks to awesome pointers I received) I'm stuck on one last aspect of tool.

I am basically writing a cleaner for files that I have with data past the EOF marker. This extra data means they fail some validation tools. I need to strip the extra data, so they be presented to the validator, however I don't want to throw this data away (in fact I have to keep it...)

I've written an XML container to hold the data, and a few other provenance/audit type values, but I'm (still) stuck on elegantly moving between raw binary and something I can "bake" in to a file.

example:

A jpg file ends with (hex editor view) 96 1a 9c fd ab 4f 9e 69 27 ad fd da 0a db 76 bb ee d2 6a fd ff 00 ff d9 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff

The EOF marker for jpg is ff d9, so the cleaner works backwards through the file until its a match against the EOF marker. In this case it would create a new jpg file stopping at the ff d9 and then attempt to write the stripped data to the XML (via the elementTree lib): changeString.text =str(excessData)

Of course this wont work as the XML writer is looking to write ASCII not binary dumps.

In the above case, the error is UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128) which I can see if because its not a valid ASCII character

My question therefore, is how do I elegantly deal with this raw data, in a way it can stored and used in the future? (I plan to write an 'uncleaner' next that can take the clean file and the XML and reconstruct the original file...)

______EDIT_______

Using the suggestions from below, this is the traceback:

Traceback (most recent call last):
  File "C:\...\EOF_cleaner\scripts\test6.py", line 87, in <module> main()
  File "C:\...\EOF_cleaner\scripts\test6.py", line 73, in main splitFile(f_data, offset)
  File "C:\...EOF_cleaner\scripts\test6.py", line 60, in splitFile makeXML(excessData)
  File "C:\...\EOF_cleaner\scripts\test6.py", line 53 in makeXML ET.ElementTree(root).write(noteFile)
  File "c:\python27\lib\xml\etree\ElementTree.py", line 815, in write serialize(write, self._root, encoding, qnames, namespaces)
  File "c:\python27\lib\xml\etree\ElementTree.py", line 934, in _serialize_xml_serialize_xml(write, e, encoding, qnames, None)
  File "c:\python27\lib\xml\etree\ElementTree.py", line 934, in _serialize_xml_serialize_xml(write, e, encoding, qnames, None)
  File "c:\python27\lib\xml\etree\ElementTree.py", line 934, in _serialize_xml_serialize_xml(write, e, encoding, qnames, None)
  File "c:\python27\lib\xml\etree\ElementTree.py", line 932, in _serialize_xml write(_escape_cdata(text, encoding))
  File "c:\python27\lib\xml\etree\ElementTree.py", line 1068, in _escape_cdata  return text.encode(encoding, "xmlcharrefreplace")
  UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)

The line that throws things is changeString.text = excessData.encode('base64') (line 45) and ET.ElementTree(root).write(noteFile) (line 53)

Your traceback shows that something else is adding binary data to your tree too. If you disable the changeString.text line altogether, do you still get the error? — Martijn Pieters
– Martijn Pieters, Commented Sep 25, 2012 at 23:04
@MartijnPieters You are of course correct :( Time to go hunting... — Jay Gattuso
– Jay Gattuso, Commented Sep 25, 2012 at 23:06

Martijn Pieters · Accepted Answer · 2012-09-25 22:40:16Z

4

Use Base64:

excessData.encode('base64')

It'll be easy to turn that back to binary data later on with a simple .decode('base64') call.

Base64 encodes to ASCII data safe for inclusion in XML, in a reasonably compact format; every 3 bytes of binary information become 4 Base64 characters.

answered Sep 25, 2012 at 22:40

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

Jay Gattuso Over a year ago

Thank you, That looks simple enough... I tried placing that line verbatim after the line that declares the excessData variable ( excessData = f_data[offset:] but I still get the UnicodeDecode error. Do I need to do something else?

Mark Ransom Over a year ago

@JayGattuso, the encode doesn't modify the data in-place, it returns a new string. You probably want excessData = f_data[offset:].encode('base64')

Martijn Pieters Over a year ago

changeString.text = excessData.encode('base64') should work; .encode() returns the encoded value, it doesn't change the string in-place.

Jay Gattuso Over a year ago

Hmm, OK, so now I'm excessData = excessData.encode('base64'), however I still get the error - I suspect that the changeString.text = excessData may be trying to ASCIIify the variable?

Martijn Pieters Over a year ago

No, it is trying to unicodify it, and that should just work since Base64 is all ASCII; all \xff bytes is a series of / slashes in Base64. What is the code and the traceback?

|

Greg Hewgill · Accepted Answer · 2012-09-25 22:48:44Z

1

To convert raw bytes to space-separated ASCII hex, you can use something like:

>>> a = "abc\x01\x02"
>>> print(" ".join("{:02x}".format(x) for x in a))
61 62 63 01 02

However, as mentioned in other answers, something like Base64 is probably going to be more efficient and easier to work with.

edited Sep 25, 2012 at 22:48

answered Sep 25, 2012 at 22:41

Greg Hewgill

1.0m192 gold badges1.2k silver badges1.3k bronze badges

1 Comment

papachan Over a year ago

nice, it do the same result than binascii.hexlify(a)

Collectives™ on Stack Overflow

Python - convert a raw binary dump into ASCII HEX bytes

2 Answers 2

10 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

10 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related