2

I have a stream of binary data. Assume no prior knowledge about the expected pattern in input data.

The symbols can represent binary data or other symbols, hence hierarchical.

The output should minimize space, but does not need to be optimal. But the algorithm needs to be online - that is, with more input, the representation need to adapt. Approximation is allowed and very desirable if it can be controlled with some parameter that can decide trade off between accuracy of representation with update runtime and space usage.

Example: 00011100110011 => ABCDCDC => ABEEC

000 => A

111=>B

00=>C

11=>D

CD=>E

7
  • Lossless compression algorithms? Commented Mar 16, 2017 at 0:03
  • I think there are some very subtle but important differences. One of them is in my case the input data is not quantized. It is not a series of 8 bit groups. In fact how these bjt stream (input stream) should be grouped is one of the primary questions. Commented Mar 16, 2017 at 2:39
  • The second key difference is - i do not need the compression or encoding to be lossless. In fact i want to favor some amount of loss for massive representational gains. It is a much more conplex requirement than calculating a mathematically optimal encoding from the point of space usage only. Commented Mar 16, 2017 at 2:41
  • We shove such data into 8 bit groups, not because the data naturally groups at 8 bits, but because of the kind of computer hardware we use to work with it. 8 bits is the smallest addressable size. Currently 64 is the popular word size, the number of bits you get with each bus clock tick. If this is some pure theoretical exercise you're free to ignore all that but if you want to make something real consider the hardware you'll be using. Commented Mar 16, 2017 at 11:51
  • 5
    It's actually impossible to choose what acceptable loss is without prior knowledge about the expected pattern. Mp3 is lossy but takes advantage of the known limitations of the human ear. Mpeg is lossy but takes advantage of the known limitations of the human eye. You have to know something about how this data will be used to figure out what you can lose. Commented Mar 16, 2017 at 11:54

1 Answer 1

2

Assuming that lossless coding is what you want, this sounds like a good use case for Huffman coding: https://en.wikipedia.org/wiki/Huffman_coding

Huffman coding is a nearly optimal prefix code. It uses shorter bit patterns for more commonly occurring symbols, thus minimizing the amount of space required. Of course, you need to transmit the Huffman tree separately, so Huffman coding does not always compress your data.

If you want to get minor gains over what Huffman coding can achieve, you can use arithmetic coding (https://en.wikipedia.org/wiki/Arithmetic_coding) but I don't really believe that the minor gains are worth the extra complexity.

For lossy coding, answering the question would be practically impossible as CandiedOrange noticed in a comment.

2
  • 1
    Adaptive Huffman coding may be a better fit for the description, as it avoids the need to precompute and share the code tree. Commented Mar 16, 2017 at 19:06
  • Agreed with CandiedOrange. I need to go back to the whiteboard and define what loss is acceptable. Commented Mar 16, 2017 at 20:41

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.