Store arbitrary binary data on a system accepting only valid UTF8

Question

I have arbitrary binary data. I need to store it in a system that expects valid UTF8. It will never be interpreted as text, I just need to put it in there and be able to retrieve it and reconstitute my binary data.

Obviously base64 would work, but I can't have that much inflation.

How can I easily achieve this in python 2.7?

Is this an accurate summary of what you need? "I have arbitrary binary data. I need to store it in a system that expects valid UTF8. It will never be interpreted as text, I just need to put it in there and be able to retrieve it and reconstitute my binary data." — Ned Batchelder
– Ned Batchelder, Commented Aug 30, 2014 at 14:51
@NobuGames: You need to take into account that control characters in the ASCII range could be interpreted too; that's why it is still safest to stick to the printable range; otherwise you could use a Base128-style encoding to pack 7 data bytes into 8 7-bit characters. — Martijn Pieters
– Martijn Pieters, Commented Aug 30, 2014 at 15:14

Martijn Pieters · Accepted Answer · 2014-08-30 16:39:44Z

4

You'll have to express your data using just ASCII characters. Using Base64 is the most efficient method (available in the Python standard library) to do this, in terms of making binary data fit in printable text that is also UTF-8 safe. Sure, it requires 33% more space to express the same data, but other methods take more additional space.

You can combine this with compression to limit how much space this is going to take, but make the compression optional (mark the data) and only actually use it if the data is going to be smaller.

import zlib
import base64

def pack_utf8_safe(data):
    is_compressed = False
    compressed = zlib.compress(data)
    if len(compressed) < (len(data) - 1):
        data = compressed
        is_compressed = True
    base64_encoded = base64.b64encode(data)
    if is_compressed:
        base64_encoded = '.' + base64_encoded
    return base64_encoded

def unpack_utf8_safe(base64_encoded):
    decompress = False
    if base64_encoded.startswith('.'):
        base64_encoded = base64_encoded[1:]
        decompress = True
    data = base64.b64decode(base64_encoded)
    if decompress:
        data = zlib.decompress(data)
    return data

The '.' character is not part of the Base64 alphabet, so I used it here to mark compressed data.

You could further shave of the 1 or 2 = padding characters from the end of the Base64 encoded data; these can then be re-added when decoding (add '=' * (-len(encoded) * 4) to the end), but I'm not sure that's worth the bother.

You can achieve further savings by switching to the Base85 encoding, a 4-to-5 ratio ASCII-safe encoding for binary data, so a 20% overhead. For Python 2.7 this is only available in an external library (Python 3.4 added it to the base64 library). You can use python-mom project in 2.7:

from mom.codec import base85

and replace all base64.b64encode() and base64.b64decode() calls with base85.b85encode() and base85.b85decode() calls instead.

If you are 100% certain nothing along the path is going to treat your data as text (possibly altering line separators, or interpret and alter other control codes), you could also use the Base128 encoding, reducing the overhead to a 14.3% increase (8 characters for every 7 bytes). I cannot, however, recommend a pip-installable Python module for you; there is a GitHub hosted module but I have not tested it.

edited Aug 30, 2014 at 16:39

answered Aug 30, 2014 at 15:00

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

14 Comments

isedev Over a year ago

actually, base128 would be more efficient whilst fulfilling the requirement of using only ASCII characters. there are base128 libraries available (e.g. npmjs.org/package/base128)

Martijn Pieters Over a year ago

@isedev: you'd have to guarantee that control characters are never going to be interpreted anywhere during the lifetime of the encoded data, though.

Martijn Pieters Over a year ago

@isedev: for example, if this data is ever written to disk on a Windows system where the file is opened in text mode (quite common) then all \x0a codepoints (e.g. newlines) will be expanded to CRLF separators (so \x0d\x0a), disastrous for binary data.

tripleee Over a year ago

Again, this begs the question how you distinguish between line feed, vertical tab, and carriage return if you really are going to use paper as your medium. Of course, if you have the possibility to use a font which has clearly distinguishable symbols for these normally nonprinting control characters, all is well.

Martijn Pieters Over a year ago

@N.McA.: I'd do some testing with a binary string with all 255 possible byte values and round-trip that. Don't make assumptions about the QR generating library nor the QR reader not translating control codes here.

|

Ned Batchelder · Accepted Answer · 2014-08-30 14:59:31Z

0

You can decode your bytes as 8859-1 data, which will always produce a valid Unicode string. Then you can encode it to UTF8:

utf8_data = my_bytes.decode('iso8859-1').encode('utf8')

On average, half your data will be in the 0-127 range, which is one byte in UTF8, and half your data will be in the 128-255 range, which is two bytes in UTF8, so your result will be 50% larger than your input data.

If there is any structure to your data at all, then zlib compressing it as Martijn suggests, might reduce the size.

answered Aug 30, 2014 at 14:59

Ned Batchelder

378k77 gold badges583 silver badges675 bronze badges

4 Comments

Martijn Pieters Over a year ago

Base64 encoding only expands the data by 33%; I had considered using latin-1 but base64 simply beats the odds here.

Ned Batchelder Over a year ago

Hmm, I overlooked that! Base64 is better than UTF8. :(

tripleee Over a year ago

Plus Latin-1 does not have code points 128-159, and ASCII codes 0-31 (or 32 even) do not conventionally have standard glyphs.

Martijn Pieters Over a year ago

@tripleee: printability is not at issue here; the OP is putting this in a QR code.

tripleee · Accepted Answer · 2014-08-30 17:32:57Z

0

If your application really requires you to be able to represent 256 different byte values in a graphically distinguishable form, all you actually need is 256 Unicode code points. Problem solved.

ASCII codes 33-127 are a no-brainer, Unicode code points 160-255 are also good candidates for representing themselves but you might want to exclude a few which are hard to distinguish (if you want OCR or humans to handle them reliably, áåä etc might be too similar). Pick the rest from the set of code points which can be represented in two bytes -- quite a large set, but again, many of them are graphically indistinguishable from other glyphs in most renderings.

This scheme does not attempt any form of compression. I imagine you'd get better results by compressing your data prior to encoding it if that's an issue.

edited Aug 30, 2014 at 17:32

answered Aug 30, 2014 at 16:23

tripleee

192k37 gold badges318 silver badges367 bronze badges

1 Comment

tripleee Over a year ago

Oops, this is pretty similar to Ned's answer, but it addresses some issues with his proposal.

Collectives™ on Stack Overflow

Store arbitrary binary data on a system accepting only valid UTF8

3 Answers 3

14 Comments

4 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

14 Comments

4 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related