5

I have arbitrary binary data. I need to store it in a system that expects valid UTF8. It will never be interpreted as text, I just need to put it in there and be able to retrieve it and reconstitute my binary data.

Obviously base64 would work, but I can't have that much inflation.

How can I easily achieve this in python 2.7?

13
  • Question at the margin: Python 2 or 3 ? Commented Aug 30, 2014 at 14:39
  • 3
    What do you intend to do with these bytes? Commented Aug 30, 2014 at 14:40
  • @SylvainLeroux It's 2.7 Commented Aug 30, 2014 at 14:41
  • 1
    Is this an accurate summary of what you need? "I have arbitrary binary data. I need to store it in a system that expects valid UTF8. It will never be interpreted as text, I just need to put it in there and be able to retrieve it and reconstitute my binary data." Commented Aug 30, 2014 at 14:51
  • 1
    @NobuGames: You need to take into account that control characters in the ASCII range could be interpreted too; that's why it is still safest to stick to the printable range; otherwise you could use a Base128-style encoding to pack 7 data bytes into 8 7-bit characters. Commented Aug 30, 2014 at 15:14

3 Answers 3

4

You'll have to express your data using just ASCII characters. Using Base64 is the most efficient method (available in the Python standard library) to do this, in terms of making binary data fit in printable text that is also UTF-8 safe. Sure, it requires 33% more space to express the same data, but other methods take more additional space.

You can combine this with compression to limit how much space this is going to take, but make the compression optional (mark the data) and only actually use it if the data is going to be smaller.

import zlib
import base64

def pack_utf8_safe(data):
    is_compressed = False
    compressed = zlib.compress(data)
    if len(compressed) < (len(data) - 1):
        data = compressed
        is_compressed = True
    base64_encoded = base64.b64encode(data)
    if is_compressed:
        base64_encoded = '.' + base64_encoded
    return base64_encoded

def unpack_utf8_safe(base64_encoded):
    decompress = False
    if base64_encoded.startswith('.'):
        base64_encoded = base64_encoded[1:]
        decompress = True
    data = base64.b64decode(base64_encoded)
    if decompress:
        data = zlib.decompress(data)
    return data

The '.' character is not part of the Base64 alphabet, so I used it here to mark compressed data.

You could further shave of the 1 or 2 = padding characters from the end of the Base64 encoded data; these can then be re-added when decoding (add '=' * (-len(encoded) * 4) to the end), but I'm not sure that's worth the bother.

You can achieve further savings by switching to the Base85 encoding, a 4-to-5 ratio ASCII-safe encoding for binary data, so a 20% overhead. For Python 2.7 this is only available in an external library (Python 3.4 added it to the base64 library). You can use python-mom project in 2.7:

from mom.codec import base85

and replace all base64.b64encode() and base64.b64decode() calls with base85.b85encode() and base85.b85decode() calls instead.

If you are 100% certain nothing along the path is going to treat your data as text (possibly altering line separators, or interpret and alter other control codes), you could also use the Base128 encoding, reducing the overhead to a 14.3% increase (8 characters for every 7 bytes). I cannot, however, recommend a pip-installable Python module for you; there is a GitHub hosted module but I have not tested it.

Sign up to request clarification or add additional context in comments.

14 Comments

actually, base128 would be more efficient whilst fulfilling the requirement of using only ASCII characters. there are base128 libraries available (e.g. npmjs.org/package/base128)
@isedev: you'd have to guarantee that control characters are never going to be interpreted anywhere during the lifetime of the encoded data, though.
@isedev: for example, if this data is ever written to disk on a Windows system where the file is opened in text mode (quite common) then all \x0a codepoints (e.g. newlines) will be expanded to CRLF separators (so \x0d\x0a), disastrous for binary data.
Again, this begs the question how you distinguish between line feed, vertical tab, and carriage return if you really are going to use paper as your medium. Of course, if you have the possibility to use a font which has clearly distinguishable symbols for these normally nonprinting control characters, all is well.
@N.McA.: I'd do some testing with a binary string with all 255 possible byte values and round-trip that. Don't make assumptions about the QR generating library nor the QR reader not translating control codes here.
|
0

You can decode your bytes as 8859-1 data, which will always produce a valid Unicode string. Then you can encode it to UTF8:

utf8_data = my_bytes.decode('iso8859-1').encode('utf8')

On average, half your data will be in the 0-127 range, which is one byte in UTF8, and half your data will be in the 128-255 range, which is two bytes in UTF8, so your result will be 50% larger than your input data.

If there is any structure to your data at all, then zlib compressing it as Martijn suggests, might reduce the size.

4 Comments

Base64 encoding only expands the data by 33%; I had considered using latin-1 but base64 simply beats the odds here.
Hmm, I overlooked that! Base64 is better than UTF8. :(
Plus Latin-1 does not have code points 128-159, and ASCII codes 0-31 (or 32 even) do not conventionally have standard glyphs.
@tripleee: printability is not at issue here; the OP is putting this in a QR code.
0

If your application really requires you to be able to represent 256 different byte values in a graphically distinguishable form, all you actually need is 256 Unicode code points. Problem solved.

ASCII codes 33-127 are a no-brainer, Unicode code points 160-255 are also good candidates for representing themselves but you might want to exclude a few which are hard to distinguish (if you want OCR or humans to handle them reliably, áåä etc might be too similar). Pick the rest from the set of code points which can be represented in two bytes -- quite a large set, but again, many of them are graphically indistinguishable from other glyphs in most renderings.

This scheme does not attempt any form of compression. I imagine you'd get better results by compressing your data prior to encoding it if that's an issue.

1 Comment

Oops, this is pretty similar to Ned's answer, but it addresses some issues with his proposal.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.