2

How do we shrink/encode a 20 letter string to 6 letters. I found few algorithms address data compression like RLE, Arithmetic coding, Universal code but none of them guarantees 6 letters.

The original string can contain the characters A-Z (upper case), 0-9 ans a dash.

17
  • 6
    If you want lossless encoding, it's impossible. There are 20^128 possible ASCII strings of length 20, and only 6^128 strings of length 6. There's no way you can cram the first category into the second. Commented Dec 24, 2013 at 18:08
  • Any restrictions on the type of 20 letter strings? Commented Dec 24, 2013 at 18:08
  • It's not possible to guarantee this. You can only compress strings that have some kind of repetition that can be encoded. Commented Dec 24, 2013 at 18:08
  • 1
    Oops, did I? Well, even so, the first number is bigger than the second, so my original point is still valid. Recommended reading: pigeonhole principle, in particular the bit that says, "any lossless compression algorithm, provided it makes some inputs smaller (as the name compression suggests), will also make some other inputs larger." Commented Dec 24, 2013 at 18:11
  • 1
    @shoover I'm pretty sure OP wants something reversible (the usual meaning of the term "encode"), which is rather distinct from this sort of hash function... Commented Dec 24, 2013 at 18:54

1 Answer 1

6

If your goal is to losslessly compress or hash an random input string of 20 characters (each character could be [A-Z], [0-9] or -) to an output string of 6 characters. It's theoretically impossible.

In information theory, given a discrete random variable X={x|x1,...,xn}, the Shannon entropy H(X) is defined as:

enter image description here

where p(xi) is the probablity of X = xi. In your case, X has 20 of 37 possible characters, so it could be {x|x1,...,xn} where n = 37^20. Supposing the 37 characters have the same probability of being (aka the input string is random), then p(xi) = 1/37^20. So the Shannon entropy of the input is:

enter image description here

. A char in common computer can hold 8 bit, so that 6 chars can hold 48 bit. There's no way to hold 104 bit information by 6 chars. You need at least 15 chars to hold it instead.


If you do allow the loss and have to hash the 20 chars into 6 chars, then your are trying to hash 37^20 values to 128^6 keys. It could be done, but you would got plenty of hash collisions.

In your case, supposing you hash them with the most uniformity (otherwise it would be worse), for each input value, there would be by average of 5.26 other input values sharing the same hash key with it. By a birthday attack, we could expect to find a collision within approximately 200 million trials. It could be done in less than 10 seconds by a common laptop. So I don't think this would be a safe hashing.

However if you insist to do that, you might want to read Hash function algorithms. It lists a lot of algorithms for your choice. Good luck!

Sign up to request clarification or add additional context in comments.

2 Comments

Did OP state a requirement for losslessness?
@shoover I am editing to consider the case of loss:)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.