In Run Length Encoding (RLE), a large set of information is encoded by storing the quantity of consecutive sequences. A canonical example is:
WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW
Becomes
12W1B12W3B24W1B14W
I'm using an algorithm that is similar, but slightly different.
The data to encode is an associative array. When sorted, it looks like this:
[{A:1},{B:1},{C:1},{D:1},{E:1},{F:2},{G:2},{H:2},{I:2},{M:3},{N:3},{O:3},{W:4},{X:4}]
I can compress this using half-open intervals:
[{[A,F):1},{[F,J):2},{[M,P):3},{[W,Y):4}]
If I account for gaps, I only need to describe the partitions, and can remove the end-value of the intervals:
[{A:1},{F:2},{J:-},{M:3},{P:-},{W:4},{Y:-}]
And now I can write this as a single string:
A1F2J-M3P-W4Y-
If I had not gone through this process and just recorded every element separately, I'd have:
A1B1C1D1E1F2G2H2I2M3N3O3W4X4
So there is significant compression, which will be more dramatic the more consecutive values are in the source data.
Notes:
Characters are just for illustration; It works with byte arrays as well.
Maintaining sort order when reversing the algorithm is not a requirement, as the input is a typical associative array, as in a
HashTableorDictionary, and sequence is usually not guaranteed in these data types.It is quite easy to look up a value from the encoded string without having to fully decode it. For example, I can find
N=3by scanning every other character, looking for the first character that is greater thanN(which would beP) and taking the value from the previous entry. This can be done with either linear or binary search.
Questions
Is this a well-known algorithm?
Is it just a variation of Run Length Encoding?
If it's not RLE, then does it have a canonical name of its own?
I'm not asking if it works or not - I already know it does. I'm simply asking if this algorithm has a name of its own, or if it's just a variation on RLE.