1

In Run Length Encoding (RLE), a large set of information is encoded by storing the quantity of consecutive sequences. A canonical example is:

WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW

Becomes

12W1B12W3B24W1B14W

I'm using an algorithm that is similar, but slightly different.

The data to encode is an associative array. When sorted, it looks like this:

[{A:1},{B:1},{C:1},{D:1},{E:1},{F:2},{G:2},{H:2},{I:2},{M:3},{N:3},{O:3},{W:4},{X:4}]

I can compress this using half-open intervals:

[{[A,F):1},{[F,J):2},{[M,P):3},{[W,Y):4}]

If I account for gaps, I only need to describe the partitions, and can remove the end-value of the intervals:

[{A:1},{F:2},{J:-},{M:3},{P:-},{W:4},{Y:-}]

And now I can write this as a single string:

A1F2J-M3P-W4Y-

If I had not gone through this process and just recorded every element separately, I'd have:

A1B1C1D1E1F2G2H2I2M3N3O3W4X4

So there is significant compression, which will be more dramatic the more consecutive values are in the source data.

Notes:

  • Characters are just for illustration; It works with byte arrays as well.

  • Maintaining sort order when reversing the algorithm is not a requirement, as the input is a typical associative array, as in a HashTable or Dictionary, and sequence is usually not guaranteed in these data types.

  • It is quite easy to look up a value from the encoded string without having to fully decode it. For example, I can find N=3 by scanning every other character, looking for the first character that is greater than N (which would be P) and taking the value from the previous entry. This can be done with either linear or binary search.

Questions

  1. Is this a well-known algorithm?

  2. Is it just a variation of Run Length Encoding?

  3. If it's not RLE, then does it have a canonical name of its own?

I'm not asking if it works or not - I already know it does. I'm simply asking if this algorithm has a name of its own, or if it's just a variation on RLE.

10
  • 1
    I don't understand the downvote. Hover over the downvote arrow and find the criteria. 1) I have researched this and found only RLE, and 2) The question is worded very clearly. Also it is on-topic for this site, is it not? Commented Oct 20, 2015 at 18:54
  • I think the question belongs better here than on stack overflow. Perhaps someone considered the scope too narrow to be useful? Hashing and compression are pretty broad subjects though. Commented Oct 20, 2015 at 19:04
  • 2
    @MattJohnson: Note: I'm not criticizing your question, just clarifying your comment ;-) It is commonly accepted that these kinds of questions are kind of circular, because in order to be allowed to ask the question in the first place, the thing to name has to be a well-known concept with a single canonical name, which means that strictly speaking in order to know whether or not the question is on-topic, you need to already know the answer. So, we try to walk a very fine line and err on the side of closing, which is necessary for site quality and unfortunately annoying and unhelpful for askers. Commented Oct 20, 2015 at 20:32
  • 1
    In short, "We're not a dictionary." Commented Oct 20, 2015 at 20:34
  • 2
    cs.se probably doesn't accept "name that thing" questions either. You might want to check with them first (every site has a chat room, if you're not sure). Commented Oct 20, 2015 at 20:39

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.