Shrink string encoding algorithm

Question

How do we shrink/encode a 20 letter string to 6 letters. I found few algorithms address data compression like RLE, Arithmetic coding, Universal code but none of them guarantees 6 letters.

The original string can contain the characters A-Z (upper case), 0-9 ans a dash.

If you want lossless encoding, it's impossible. There are 20^128 possible ASCII strings of length 20, and only 6^128 strings of length 6. There's no way you can cram the first category into the second. — Kevin
– Kevin, Commented Dec 24, 2013 at 18:08
It's not possible to guarantee this. You can only compress strings that have some kind of repetition that can be encoded. — Barmar
– Barmar, Commented Dec 24, 2013 at 18:08
Oops, did I? Well, even so, the first number is bigger than the second, so my original point is still valid. Recommended reading: pigeonhole principle, in particular the bit that says, "any lossless compression algorithm, provided it makes some inputs smaller (as the name compression suggests), will also make some other inputs larger." — Kevin
– Kevin, Commented Dec 24, 2013 at 18:11
@shoover I'm pretty sure OP wants something reversible (the usual meaning of the term "encode"), which is rather distinct from this sort of hash function... — twalberg
– twalberg, Commented Dec 24, 2013 at 18:54

timothy · Accepted Answer · 2013-12-24 23:49:09Z

6

If your goal is to losslessly compress or hash an random input string of 20 characters (each character could be [A-Z], [0-9] or -) to an output string of 6 characters. It's theoretically impossible.

In information theory, given a discrete random variable X={x|x1,...,xn}, the Shannon entropy H(X) is defined as:

enter image description here

where p(xi) is the probablity of X = xi. In your case, X has 20 of 37 possible characters, so it could be {x|x1,...,xn} where n = 37^20. Supposing the 37 characters have the same probability of being (aka the input string is random), then p(xi) = 1/37^20. So the Shannon entropy of the input is:

enter image description here

. A char in common computer can hold 8 bit, so that 6 chars can hold 48 bit. There's no way to hold 104 bit information by 6 chars. You need at least 15 chars to hold it instead.

If you do allow the loss and have to hash the 20 chars into 6 chars, then your are trying to hash 37^20 values to 128^6 keys. It could be done, but you would got plenty of hash collisions.

In your case, supposing you hash them with the most uniformity (otherwise it would be worse), for each input value, there would be by average of 5.26 other input values sharing the same hash key with it. By a birthday attack, we could expect to find a collision within approximately 200 million trials. It could be done in less than 10 seconds by a common laptop. So I don't think this would be a safe hashing.

However if you insist to do that, you might want to read Hash function algorithms. It lists a lot of algorithms for your choice. Good luck!

edited Dec 24, 2013 at 23:49

answered Dec 24, 2013 at 20:02

timothy

4,4875 gold badges30 silver badges52 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

shoover Over a year ago

Did OP state a requirement for losslessness?

timothy Over a year ago

@shoover I am editing to consider the case of loss:)

Collectives™ on Stack Overflow

Shrink string encoding algorithm

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related