Different UTF-16 Encoding in Java versus C#

Question

I am struggling with different results when converting a string to bytes in C# vs. Java.

C#:

byte[] byteArray =  Encoding.Unicode.GetBytes ("chess ¾");
for (int i = 0; i < byteArray.Length; i++)
    System.Diagnostics.Debug.Write (" " + byteArray[i]);
System.Diagnostics.Debug.WriteLine("");
System.Diagnostics.Debug.WriteLine(Encoding.Unicode.GetString(byteArray));

displays:

99 0 104 0 101 0 115 0 115 0 32 0 190 0
chess ¾

Java:

byte[] byteArray = "chess ¾".getBytes("UTF-16LE");
for (int i = 0; i < byteArray.length; i++)
        System.out.print(" " + (byteArray[i]<0?(-byteArray[i]+128):byteArray[i]));
System.out.println("");
System.out.println(new String(byteAppName,"UTF-16LE"));

displays:

99 0 104 0 101 0 115 0 115 0 32 0 194 0
chess ¾

Notice that the second to last value in the byte array is different! My objective is to encrypt this data and be able to read it from either C# or Java. This difference appears to be an obstacle.

As a side note, before I learned to use Unicode(C#)/UTF-16LE(Java), I was using UTF-8 ...

C#: byte[] byteArray = Encoding.UTF8.GetBytes ("chess ¾");

displays: 99 104 101 115 115 32 194 190

Java: byteArray = appName.getBytes("UTF-8");

displays: 99 104 101 115 115 32 190 194

Which, strangely results in the second to last and last bytes being flipped.

Lastly, Unicode for ¾ is decimal 190 (http://www.fileformat.info/info/unicode/char/BE/index.htm), not decimal 194 (Â) (http://www.fileformat.info/info/unicode/char/00c2/index.htm).

Any help would be greatly appreciated.

Curious - What kind of output do you get if you manually put the byte array from one into the other (ie: try to decode the bytes)? — Krease
– Krease, Commented Dec 8, 2015 at 23:16
Oh, and welcome to StackOverflow! This is an excellent first question :) — Krease
– Krease, Commented Dec 8, 2015 at 23:18
Good question: In Java: byteArray = new byte[] {99, 0, 104, 0, 101, 0, 115, 0, 115, 0, 32, 0, -62, 0}; displays: 99 0 104 0 101 0 115 0 115 0 32 0 190 0 chess Â — baskren
– baskren, Commented Dec 8, 2015 at 23:21
And in C#: byte[] byteArray = new byte[] {99, 0, 104, 0, 101, 0, 115, 0, 115, 0, 32, 0, 194, 0}; displays: 99 0 104 0 101 0 115 0 115 0 32 0 194 0 chess Â — baskren
– baskren, Commented Dec 8, 2015 at 23:25
@Chris - and thanks for responding. Any insights are greatly appreciated. — baskren
– baskren, Commented Dec 8, 2015 at 23:31

morgano · Accepted Answer · 2015-12-09 00:29:07Z

4

Your problem is not in the encoding, it is in the way you're printing the results, you are converting from byte to integer using byteArray[i] < 0 ? (-byteArray[i] + 128) : byteArray[i] which will give you incorrect results, use something else like byteArray[i] & 0xFF instead. compare both conversions using this poc:

    String encoding = "UTF-16LE";
    byte[] byteArray = "chess ¾".getBytes(encoding);
    for (int i = 0; i < byteArray.length; i++) {
        // your conversion
        System.out.print(" " + (byteArray[i] < 0 ? (-byteArray[i] + 128) : byteArray[i]));
       // a more appropriate one
        System.out.print("(" + (byteArray[i] & 0xFF) + ") ");
    }
    System.out.println("");
    System.out.println(new String(byteArray, encoding));

edited Dec 9, 2015 at 0:29

answered Dec 8, 2015 at 23:50

morgano

17.5k11 gold badges48 silver badges57 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

sstan Over a year ago

This also means that using UTF-8 was also working fine.

sstan Over a year ago

I would just like to add that the problem is not an overflow one. It's just that the formula used was wrong, plain and simple. byteArray[i] < 0 ? byteArray[i] + 256 : byteArray[i] would have also worked fine, for instance.

morgano Over a year ago

@sstan you're right, lets call it an "underflow" caused by the - operation, as byte b = -1; System.out.println(">> " + (-b)) would produce >> 1 and not >> -2

morgano Over a year ago

@sstan on a second though, you're right, the problem is not underflow/overflow, the formula used is plain wrong

baskren Over a year ago

Thank you all - your help is greatly appreciated. This is my favorite way of being wrong.

Community · Accepted Answer · 2020-06-20 09:12:55Z

1

My guess.

UTF-16LE means that characters take 2 or 4 bytes.

Check this out and scroll down to 3/4. You will see both a 190 and a 194 (11000010 10111110) - these are the two bytes you need to encode the symbol, which is apparently called "VULGAR FRACTION THREE QUARTERS".

When you create a byte[], the array can only store 1 byte, never two, so you will miss one. It looks like in C# you miss 194, and in Java you miss 190.

The reason is endianness. See this answer.

In Java, getBytes("UTF-16") returns an a big-endian representation.

C#'s System.Text.Encoding.Unicode.GetBytes returns a little-endian representation.

However, in Java, getBytes("UTF-16LE") returns in little-endian according to this, and that is what you are using.

I'm having doubts now.

I need to think more about what exactly you're doing in Java. Not sure yet how to resolve it.

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered Dec 8, 2015 at 23:16

pushkin

10.4k16 gold badges65 silver badges112 bronze badges

3 Comments

baskren Over a year ago

Thanks for responding, @Pushkin. Something like your concern led me away from using UTF-8 to UTF-16LE (as well as UTF-16BE). The UTF-8 conversion to byte[] didn't loose any bytes - but their order was different between C# and Java (I've edited my original post to illustrate this).

baskren Over a year ago

See this posting (stackoverflow.com/a/9438470/4868078) for why I started using UTF-16LE.

Wai Ha Lee Over a year ago

UTF-16 is always in blocks of 16 bits - either one or two blocks per character. From Wikipedia: "The encoding is variable-length, as code points are encoded with one or two 16-bit code units."

Collectives™ on Stack Overflow

Different UTF-16 Encoding in Java versus C#

2 Answers 2

5 Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related