Data compression scheme, math

Question

I have about 42,000 lists of 24 random numbers, all in the range [0, 255]. For example, the first list might be [32, 15, 26, 27, ... 11]. The second list might be [44, 44, 18, 19, .. 113]. How can I choose a number from each of the lists so that (so I will end up with a new list of about 42,000 numbers) such that this new list is most compressible using ZIP?

-- this question has to do with math, data compression

Is this a homework problem by any chance? Please add the homework tag if so. — mtrw
– mtrw, Commented Jan 10, 2011 at 21:09
No. I am trying to solve the AMillionRandomDigits.bin compression challenge at marknelson.us/2006/06/20/million-digit-challenge (see my pete6 post). — user562688
– user562688, Commented Jan 10, 2011 at 23:04

mtrw · Accepted Answer · 2011-01-10 22:01:17Z

The ZIP file format uses DEFLATE for its compression algorithm. So you need to consider how that algorithm works and pick data such that the algorithm finds it easy to compress. According to the wikipedia article, there are two stages of compression. The first uses LZ77 to find repeated sections of data and replace them with short references. The second uses Huffman coding to take the remaining data and strip out redundancy across the whole block. This is called entropic coding - if the information isn't very random (has low entropy) the code replaces common things with short symbols, increasing the entropy.

In general, then, lists with lots of repeated runs (i.e., [111,2,44,93,111,2,44,93...]) will compress well in the first pass. Lists with lots of repeated numbers within other random stuff (i.e., [111,34,43,50,111,34,111,111,2,34,22,60,111,98,2], where 34 and 111 show up often) will compress well in the second pass.

To find suitable numbers, I think the easiest thing to do is just sort each list, then merge them, keeping the merge sorted, until you get to 42000 output numbers. You'll get runs as they happen. This won't be optimal, you might have the number 255 in each input list and you'd miss them using this technique, but it would be easy.

Another approach would be to histogram the numbers into 256 bins. Any bins that stand out indicate numbers that should be grouped. After that, I guess you have to search for sequences. Again, sorting the inputs will probably make this easier.

I just noticed you had the constraint that you have to pick one number from each list. So in both cases you could sort each list then remove duplicates.

Additionally, Huffman codes can be generated using a tree, so I wonder if there's some magic tree structure you could put the numbers into that would automatically give the right answer.

Phrogz · Accepted Answer · 2011-01-10 21:28:32Z

0

This smells NP-complete to me, but I am nowhere near able to prove it. On the outside, there are approximately 7.45e+57968 (!) possible configurations to test. It doesn't seem that you can opt out of a particular configuration early, as an incompressible initial section could be greatly compressible later on.

My best guess for "good" compression would be to count the number of occurrences of each number across the entire million-element set and select from each list the numbers with the most occurrences. For example, if every list has 42 present in it, selecting that only would give you a very-compressible array of 42,000 instances of the same value.

answered Jan 10, 2011 at 21:28

Phrogz

304k115 gold badges669 silver badges758 bronze badges

Collectives™ on Stack Overflow

Data compression scheme, math

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related