I need to find a set of hierarchical symbols that can represent input binary data in near optimal space. What algorithms can I look into? [closed]

Question

Closed. This question needs details or clarity. It is not currently accepting answers.

Want to improve this question? As written, this question is lacking some of the information it needs to be answered. If the author adds details in comments, consider editing them into the question. Once there's sufficient detail to answer, vote to reopen the question.

Closed 8 years ago.

Improve this question

I have a stream of binary data. Assume no prior knowledge about the expected pattern in input data.

The symbols can represent binary data or other symbols, hence hierarchical.

The output should minimize space, but does not need to be optimal. But the algorithm needs to be online - that is, with more input, the representation need to adapt. Approximation is allowed and very desirable if it can be controlled with some parameter that can decide trade off between accuracy of representation with update runtime and space usage.

Example: 00011100110011 => ABCDCDC => ABEEC

000 => A

111=>B

00=>C

11=>D

CD=>E

I think there are some very subtle but important differences. One of them is in my case the input data is not quantized. It is not a series of 8 bit groups. In fact how these bjt stream (input stream) should be grouped is one of the primary questions. — Quark
– Quark, Commented Mar 16, 2017 at 2:39
The second key difference is - i do not need the compression or encoding to be lossless. In fact i want to favor some amount of loss for massive representational gains. It is a much more conplex requirement than calculating a mathematically optimal encoding from the point of space usage only. — Quark
– Quark, Commented Mar 16, 2017 at 2:41
We shove such data into 8 bit groups, not because the data naturally groups at 8 bits, but because of the kind of computer hardware we use to work with it. 8 bits is the smallest addressable size. Currently 64 is the popular word size, the number of bits you get with each bus clock tick. If this is some pure theoretical exercise you're free to ignore all that but if you want to make something real consider the hardware you'll be using. — candied_orange
– candied_orange, Commented Mar 16, 2017 at 11:51
It's actually impossible to choose what acceptable loss is without prior knowledge about the expected pattern. Mp3 is lossy but takes advantage of the known limitations of the human ear. Mpeg is lossy but takes advantage of the known limitations of the human eye. You have to know something about how this data will be used to figure out what you can lose. — candied_orange
– candied_orange, Commented Mar 16, 2017 at 11:54

juhist · Accepted Answer · 2017-03-16 13:25:41Z

2

Assuming that lossless coding is what you want, this sounds like a good use case for Huffman coding: https://en.wikipedia.org/wiki/Huffman_coding

Huffman coding is a nearly optimal prefix code. It uses shorter bit patterns for more commonly occurring symbols, thus minimizing the amount of space required. Of course, you need to transmit the Huffman tree separately, so Huffman coding does not always compress your data.

If you want to get minor gains over what Huffman coding can achieve, you can use arithmetic coding (https://en.wikipedia.org/wiki/Arithmetic_coding) but I don't really believe that the minor gains are worth the extra complexity.

For lossy coding, answering the question would be practically impossible as CandiedOrange noticed in a comment.

answered Mar 16, 2017 at 13:25

juhist

2,57912 silver badges14 bronze badges

1

Adaptive Huffman coding may be a better fit for the description, as it avoids the need to precompute and share the code tree.

Jules
– Jules

2017-03-16 19:06:06 +00:00
Commented Mar 16, 2017 at 19:06
Agreed with CandiedOrange. I need to go back to the whiteboard and define what loss is acceptable.

Quark
– Quark

2017-03-16 20:41:58 +00:00
Commented Mar 16, 2017 at 20:41

Add a comment |

Stack Exchange Network

I need to find a set of hierarchical symbols that can represent input binary data in near optimal space. What algorithms can I look into? [closed]

1 Answer 1

Hot Network Questions

I need to find a set of hierarchical symbols that can represent input binary data in near optimal space. What algorithms can I look into? [closed]

1 Answer 1

Related

Hot Network Questions