1

I'm looking at ways to deterministically replace unique strings with unique and optimally short replacements. So I have a finite set of strings, and the best compression I could achieve so far is through an enumeration algorithm, where I order the input set and then replace the strings with an enumeration of char strings over an extended alphabet (a..z, A...Z, aa...zz, aA... zZ, a0...z9, Aa..., aaa...zaa, aaA...zaaA, ....).

This works wonderfully as far as compression is concerned, but has the severe drawback that it is not atomic on any given input string. Rather, its result depends on knowing all input strings right from the start, and on the ordering of the input set.

Anybody knows of an algorithm that has similar compression but doesn't require knowing all input strings upfront?! Hashing for example would not work for me, as depending on the size of the input set I'd need a hash length of 8-12 for the hashes to be unique, and that would be too long as replacements (currently, the replacement strings are 1-3 chars long for my use cases (<10,000 input strings)). Also, if theoreticians among us know this is wasted effort, I would be interested to hear :-) .

8
  • From what alphabet are the characters of the possible inputs drawn? eg just lower case letters; upper and lower case letter; alphanumerics; etc. Also, I think you mean 'deterministic' where you have 'atomic'. Commented Feb 23, 2011 at 17:58
  • Unless you give more details about the type input strings it will be hard to answer. There cannot be a generic algorithm which works on individual strings without collisions. Consider a huge file as a single string. Now you are trying to represent that using just 3 bytes... Commented Feb 23, 2011 at 18:11
  • @AakashM The input strings are basically (?u)[a-zA-z_$][\.\w$]*, so unicode alphanums with a few extra chars. With 'atomic' I meant that I can't compute the replacement for a given input string on its own and get away with it, because, yes, it would not be deterministic. Commented Feb 23, 2011 at 18:11
  • @Moron Think of identifiers in C-like program text. Commented Feb 23, 2011 at 18:13
  • Is this for a javascript minifier or something similar? Commented Feb 23, 2011 at 20:08

2 Answers 2

1

You could use your enumeration scheme, but sorted by the order in which you first encounter the input strings.

For example, the first string you ever process can be mapped to "a". The next distinct string would be mapped to "b", etc.

Every time you process a string, you'd need to look it up to see if it has already been mapped.

Sign up to request clarification or add additional context in comments.

1 Comment

Oh yes, the lookup relaxes on the ordering issue. Thanks!
1

"Optimally short" depends on the population of strings from which your samples are drawn. In the absence of systematic redundancy in the population, you will find that only a fraction of arbitrary strings can be compressed at all (e.g., consider trying to compress random bit strings).

If you can make assumptions about your data, such as "the strings are expected to be mainly composed of English words" then you can do something simple and effective based on letter frequency (e.g., for English, the relative frequency order is something like ETAOINSHRDLUGCY..., so you would want to use fewer bits to represent Es and more bits to represent uncommon letters like Q).

Cheers.

3 Comments

Thanks, but it's not about an encoding that has to be decoded at some point (Maybe I should have avoided the term 'compression'). It's really about an bijective mapping from strings to (nearly) arbitrary short strings. I guess typical string compression algorithms would leave me with replacements that are much longer than 1-3 chars.
@ThomasH -- Er, a bijection from arbitrary long strings to short strings is compression!
Agreed :). It's just that people often think of it as a process that has to be reversed at some point (aka "uncompress"), which is not what I need.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.