Can some ASCII characters interrupt/end any string/array/stream in Java?

Question

I'm currently working on a little program that compresses text by replacing repeated words/phrases with a reference to the next occurrence - thus compressing a string into a shorter string with no metadata or arrays or whatever techniques are used in real compression. My references are stored as pairs of chars in a sense like this:

(char)7 + (char)(length << 4 + offset)

where (char)7 is just an arbitrary selected char for signaling a compressed reference. Both length and offset are full range byte variables referring to the number of words that will be substituted and the offset until the next occurrence. (I'ts not relevant for the question, but I'm treating them as unsigned bytes by manual unsigned<->signed conversion.)

//Example compression would result like this:
String input = "compression and compression";
String output = (char)7 + (char)18 + " and compression"
//(char)18 - binary 0001 0010 would be saying 1 word repeat, from 2 words ahead.

TL;DR:, I'm afraid that there may be special situations that can interpret my custom char as a special ASCII character. I am aware that Strings in java ignores \0 characters (Due to this question). But are there any other java methods/classes that could cause problems? Say if I were to send/convert the compressed string with things like streams, buffers, readers, char arrays and so on?

Joop Eggen · Accepted Answer · 2016-07-20 12:18:17Z

1

String holds Unicode symbols, called code points, and char has 2 bytes and is in UTF-16, a special format. Especially there are surrogate pairs of 2 chars to represent code points above the 2 byte range, above 2¹⁶.

Alternatively to using char you could do all in code points, which in java are of type int. However there is an upper limit to legal Unicode.

You could however get away with your encoding, as long as you do no conversion to bytes in some encoding. And that is the real problem.

As there is no clean solution, byte[], ByteArrayOutputStream or ByteBuffer (with putShort and such) might be cleaner.

answered Jul 20, 2016 at 12:18

Joop Eggen

110k8 gold badges89 silver badges142 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Chexxor Over a year ago

Do you mean that I should strictly avoid converting my compressed string into bytes? Would I be better off never making the output into a string and rather make a byte[] directly for the compressed data?

Joop Eggen Over a year ago

Yes, in that way you can have short offsets (one byte), long offsets (two bytes) and such. Also using indices in the bytes array. Should look nicer.

Stephen C · Accepted Answer · 2016-07-20 11:56:13Z

1

There are no values that have any special meaning of "interrupting" or "ending" a Java string, array or stream.

(At least, not unless you have designed your application, or used / chosen a protocol or encoding that places a special meaning of that nature on specific values. I don't imagine that you have done ... because if you had done, you would not be asking this question.)

edited Jul 20, 2016 at 11:56

answered Jul 20, 2016 at 11:49

Stephen C

723k95 gold badges849 silver badges1.3k bronze badges

1 Comment

Chexxor Over a year ago

I see! No I haven't made my own protocol/encoder, but I just started learning about network programming, and I don't know things like how another computer would interpret an incoming stream of bytes, or how it would convert that stream back into a string.

Collectives™ on Stack Overflow

Can some ASCII characters interrupt/end any string/array/stream in Java?

2 Answers 2

2 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related