4

I am struggling with different results when converting a string to bytes in C# vs. Java.

C#:

byte[] byteArray =  Encoding.Unicode.GetBytes ("chess ¾");
for (int i = 0; i < byteArray.Length; i++)
    System.Diagnostics.Debug.Write (" " + byteArray[i]);
System.Diagnostics.Debug.WriteLine("");
System.Diagnostics.Debug.WriteLine(Encoding.Unicode.GetString(byteArray));

displays:

99 0 104 0 101 0 115 0 115 0 32 0 190 0
chess ¾

Java:

byte[] byteArray = "chess ¾".getBytes("UTF-16LE");
for (int i = 0; i < byteArray.length; i++)
        System.out.print(" " + (byteArray[i]<0?(-byteArray[i]+128):byteArray[i]));
System.out.println("");
System.out.println(new String(byteAppName,"UTF-16LE"));

displays:

99 0 104 0 101 0 115 0 115 0 32 0 194 0
chess ¾

Notice that the second to last value in the byte array is different! My objective is to encrypt this data and be able to read it from either C# or Java. This difference appears to be an obstacle.

As a side note, before I learned to use Unicode(C#)/UTF-16LE(Java), I was using UTF-8 ...

C#: byte[] byteArray = Encoding.UTF8.GetBytes ("chess ¾");

displays: 99 104 101 115 115 32 194 190

Java: byteArray = appName.getBytes("UTF-8");

displays: 99 104 101 115 115 32 190 194

Which, strangely results in the second to last and last bytes being flipped.

Lastly, Unicode for ¾ is decimal 190 (http://www.fileformat.info/info/unicode/char/BE/index.htm), not decimal 194 (Â) (http://www.fileformat.info/info/unicode/char/00c2/index.htm).

Any help would be greatly appreciated.

10
  • Curious - What kind of output do you get if you manually put the byte array from one into the other (ie: try to decode the bytes)? Commented Dec 8, 2015 at 23:16
  • Oh, and welcome to StackOverflow! This is an excellent first question :) Commented Dec 8, 2015 at 23:18
  • Good question: In Java: byteArray = new byte[] {99, 0, 104, 0, 101, 0, 115, 0, 115, 0, 32, 0, -62, 0}; displays: 99 0 104 0 101 0 115 0 115 0 32 0 190 0 chess  Commented Dec 8, 2015 at 23:21
  • And in C#: byte[] byteArray = new byte[] {99, 0, 104, 0, 101, 0, 115, 0, 115, 0, 32, 0, 194, 0}; displays: 99 0 104 0 101 0 115 0 115 0 32 0 194 0 chess  Commented Dec 8, 2015 at 23:25
  • @Chris - and thanks for responding. Any insights are greatly appreciated. Commented Dec 8, 2015 at 23:31

2 Answers 2

4

Your problem is not in the encoding, it is in the way you're printing the results, you are converting from byte to integer using byteArray[i] < 0 ? (-byteArray[i] + 128) : byteArray[i] which will give you incorrect results, use something else like byteArray[i] & 0xFF instead. compare both conversions using this poc:

    String encoding = "UTF-16LE";
    byte[] byteArray = "chess ¾".getBytes(encoding);
    for (int i = 0; i < byteArray.length; i++) {
        // your conversion
        System.out.print(" " + (byteArray[i] < 0 ? (-byteArray[i] + 128) : byteArray[i]));
       // a more appropriate one
        System.out.print("(" + (byteArray[i] & 0xFF) + ") ");
    }
    System.out.println("");
    System.out.println(new String(byteArray, encoding));
Sign up to request clarification or add additional context in comments.

5 Comments

This also means that using UTF-8 was also working fine.
I would just like to add that the problem is not an overflow one. It's just that the formula used was wrong, plain and simple. byteArray[i] < 0 ? byteArray[i] + 256 : byteArray[i] would have also worked fine, for instance.
@sstan you're right, lets call it an "underflow" caused by the - operation, as byte b = -1; System.out.println(">> " + (-b)) would produce >> 1 and not >> -2
@sstan on a second though, you're right, the problem is not underflow/overflow, the formula used is plain wrong
Thank you all - your help is greatly appreciated. This is my favorite way of being wrong.
1

My guess.

UTF-16LE means that characters take 2 or 4 bytes.

Check this out and scroll down to 3/4. You will see both a 190 and a 194 (11000010 10111110) - these are the two bytes you need to encode the symbol, which is apparently called "VULGAR FRACTION THREE QUARTERS".

When you create a byte[], the array can only store 1 byte, never two, so you will miss one. It looks like in C# you miss 194, and in Java you miss 190.

The reason is endianness. See this answer.

In Java, getBytes("UTF-16") returns an a big-endian representation.

C#'s System.Text.Encoding.Unicode.GetBytes returns a little-endian representation.

However, in Java, getBytes("UTF-16LE") returns in little-endian according to this, and that is what you are using.

I'm having doubts now.

I need to think more about what exactly you're doing in Java. Not sure yet how to resolve it.

3 Comments

Thanks for responding, @Pushkin. Something like your concern led me away from using UTF-8 to UTF-16LE (as well as UTF-16BE). The UTF-8 conversion to byte[] didn't loose any bytes - but their order was different between C# and Java (I've edited my original post to illustrate this).
See this posting (stackoverflow.com/a/9438470/4868078) for why I started using UTF-16LE.
UTF-16 is always in blocks of 16 bits - either one or two blocks per character. From Wikipedia: "The encoding is variable-length, as code points are encoded with one or two 16-bit code units."

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.