0

I'm currently working on a little program that compresses text by replacing repeated words/phrases with a reference to the next occurrence - thus compressing a string into a shorter string with no metadata or arrays or whatever techniques are used in real compression. My references are stored as pairs of chars in a sense like this:

(char)7 + (char)(length << 4 + offset)

where (char)7 is just an arbitrary selected char for signaling a compressed reference. Both length and offset are full range byte variables referring to the number of words that will be substituted and the offset until the next occurrence. (I'ts not relevant for the question, but I'm treating them as unsigned bytes by manual unsigned<->signed conversion.)

//Example compression would result like this:
String input = "compression and compression";
String output = (char)7 + (char)18 + " and compression"
//(char)18 - binary 0001 0010 would be saying 1 word repeat, from 2 words ahead.

TL;DR:, I'm afraid that there may be special situations that can interpret my custom char as a special ASCII character. I am aware that Strings in java ignores \0 characters (Due to this question). But are there any other java methods/classes that could cause problems? Say if I were to send/convert the compressed string with things like streams, buffers, readers, char arrays and so on?

2 Answers 2

1

String holds Unicode symbols, called code points, and char has 2 bytes and is in UTF-16, a special format. Especially there are surrogate pairs of 2 chars to represent code points above the 2 byte range, above 216.

Alternatively to using char you could do all in code points, which in java are of type int. However there is an upper limit to legal Unicode.

You could however get away with your encoding, as long as you do no conversion to bytes in some encoding. And that is the real problem.

As there is no clean solution, byte[], ByteArrayOutputStream or ByteBuffer (with putShort and such) might be cleaner.

Sign up to request clarification or add additional context in comments.

2 Comments

Do you mean that I should strictly avoid converting my compressed string into bytes? Would I be better off never making the output into a string and rather make a byte[] directly for the compressed data?
Yes, in that way you can have short offsets (one byte), long offsets (two bytes) and such. Also using indices in the bytes array. Should look nicer.
1

There are no values that have any special meaning of "interrupting" or "ending" a Java string, array or stream.

(At least, not unless you have designed your application, or used / chosen a protocol or encoding that places a special meaning of that nature on specific values. I don't imagine that you have done ... because if you had done, you would not be asking this question.)

1 Comment

I see! No I haven't made my own protocol/encoder, but I just started learning about network programming, and I don't know things like how another computer would interpret an incoming stream of bytes, or how it would convert that stream back into a string.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.