Algorithm to Map Strings to Short Replacements

Question

I'm looking at ways to deterministically replace unique strings with unique and optimally short replacements. So I have a finite set of strings, and the best compression I could achieve so far is through an enumeration algorithm, where I order the input set and then replace the strings with an enumeration of char strings over an extended alphabet (a..z, A...Z, aa...zz, aA... zZ, a0...z9, Aa..., aaa...zaa, aaA...zaaA, ....).

This works wonderfully as far as compression is concerned, but has the severe drawback that it is not atomic on any given input string. Rather, its result depends on knowing all input strings right from the start, and on the ordering of the input set.

Anybody knows of an algorithm that has similar compression but doesn't require knowing all input strings upfront?! Hashing for example would not work for me, as depending on the size of the input set I'd need a hash length of 8-12 for the hashes to be unique, and that would be too long as replacements (currently, the replacement strings are 1-3 chars long for my use cases (<10,000 input strings)). Also, if theoreticians among us know this is wasted effort, I would be interested to hear :-) .

From what alphabet are the characters of the possible inputs drawn? eg just lower case letters; upper and lower case letter; alphanumerics; etc. Also, I think you mean 'deterministic' where you have 'atomic'. — AakashM
– AakashM, Commented Feb 23, 2011 at 17:58
Unless you give more details about the type input strings it will be hard to answer. There cannot be a generic algorithm which works on individual strings without collisions. Consider a huge file as a single string. Now you are trying to represent that using just 3 bytes... — Aryabhatta
– Aryabhatta, Commented Feb 23, 2011 at 18:11
@AakashM The input strings are basically (?u)[a-zA-z_$][\.\w$]*, so unicode alphanums with a few extra chars. With 'atomic' I meant that I can't compute the replacement for a given input string on its own and get away with it, because, yes, it would not be deterministic. — ThomasH
– ThomasH, Commented Feb 23, 2011 at 18:11

mbeckish · Accepted Answer · 2011-02-23 18:05:33Z

1

You could use your enumeration scheme, but sorted by the order in which you first encounter the input strings.

For example, the first string you ever process can be mapped to "a". The next distinct string would be mapped to "b", etc.

Every time you process a string, you'd need to look it up to see if it has already been mapped.

answered Feb 23, 2011 at 18:05

mbeckish

10.6k5 gold badges34 silver badges59 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

ThomasH Over a year ago

Oh yes, the lookup relaxes on the ordering issue. Thanks!

Rafe · Accepted Answer · 2011-02-23 23:09:25Z

1

"Optimally short" depends on the population of strings from which your samples are drawn. In the absence of systematic redundancy in the population, you will find that only a fraction of arbitrary strings can be compressed at all (e.g., consider trying to compress random bit strings).

If you can make assumptions about your data, such as "the strings are expected to be mainly composed of English words" then you can do something simple and effective based on letter frequency (e.g., for English, the relative frequency order is something like ETAOINSHRDLUGCY..., so you would want to use fewer bits to represent Es and more bits to represent uncommon letters like Q).

Cheers.

answered Feb 23, 2011 at 23:09

Rafe

5,3353 gold badges26 silver badges26 bronze badges

3 Comments

ThomasH Over a year ago

Thanks, but it's not about an encoding that has to be decoded at some point (Maybe I should have avoided the term 'compression'). It's really about an bijective mapping from strings to (nearly) arbitrary short strings. I guess typical string compression algorithms would leave me with replacements that are much longer than 1-3 chars.

Rafe Over a year ago

@ThomasH -- Er, a bijection from arbitrary long strings to short strings is compression!

ThomasH Over a year ago

Agreed :). It's just that people often think of it as a process that has to be reversed at some point (aka "uncompress"), which is not what I need.

Collectives™ on Stack Overflow

Algorithm to Map Strings to Short Replacements

2 Answers 2

1 Comment

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related