1

I have about 42,000 lists of 24 random numbers, all in the range [0, 255]. For example, the first list might be [32, 15, 26, 27, ... 11]. The second list might be [44, 44, 18, 19, .. 113]. How can I choose a number from each of the lists so that (so I will end up with a new list of about 42,000 numbers) such that this new list is most compressible using ZIP?

-- this question has to do with math, data compression

3
  • Is this a homework problem by any chance? Please add the homework tag if so. Commented Jan 10, 2011 at 21:09
  • No. I am trying to solve the AMillionRandomDigits.bin compression challenge at marknelson.us/2006/06/20/million-digit-challenge (see my pete6 post). Commented Jan 10, 2011 at 23:04
  • You can't compress truly random data. Commented Jan 11, 2011 at 1:18

2 Answers 2

1

The ZIP file format uses DEFLATE for its compression algorithm. So you need to consider how that algorithm works and pick data such that the algorithm finds it easy to compress. According to the wikipedia article, there are two stages of compression. The first uses LZ77 to find repeated sections of data and replace them with short references. The second uses Huffman coding to take the remaining data and strip out redundancy across the whole block. This is called entropic coding - if the information isn't very random (has low entropy) the code replaces common things with short symbols, increasing the entropy.

In general, then, lists with lots of repeated runs (i.e., [111,2,44,93,111,2,44,93...]) will compress well in the first pass. Lists with lots of repeated numbers within other random stuff (i.e., [111,34,43,50,111,34,111,111,2,34,22,60,111,98,2], where 34 and 111 show up often) will compress well in the second pass.

To find suitable numbers, I think the easiest thing to do is just sort each list, then merge them, keeping the merge sorted, until you get to 42000 output numbers. You'll get runs as they happen. This won't be optimal, you might have the number 255 in each input list and you'd miss them using this technique, but it would be easy.

Another approach would be to histogram the numbers into 256 bins. Any bins that stand out indicate numbers that should be grouped. After that, I guess you have to search for sequences. Again, sorting the inputs will probably make this easier.

I just noticed you had the constraint that you have to pick one number from each list. So in both cases you could sort each list then remove duplicates.

Additionally, Huffman codes can be generated using a tree, so I wonder if there's some magic tree structure you could put the numbers into that would automatically give the right answer.

Sign up to request clarification or add additional context in comments.

Comments

0

This smells NP-complete to me, but I am nowhere near able to prove it. On the outside, there are approximately 7.45e+57968 (!) possible configurations to test. It doesn't seem that you can opt out of a particular configuration early, as an incompressible initial section could be greatly compressible later on.

My best guess for "good" compression would be to count the number of occurrences of each number across the entire million-element set and select from each list the numbers with the most occurrences. For example, if every list has 42 present in it, selecting that only would give you a very-compressible array of 42,000 instances of the same value.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.