3

I expected that, when a java character is stored as "UTF-16", each character uses 2 bytes, so "hello" should consume 10 bytes, but this code:

String h = "hello";
System.out.println(new String(h.getBytes("UTF-16"), "UTF-16").length());
System.out.println(new String(h.getBytes("UTF-8"), "UTF-8").getBytes("UTF-16").length);

Will print "5 12"

My question:

(1) I expected that the first println should get "10" as I mentioned. But why 5?

(2) For the second println, I am trying to getBytes for it first as "UTF-8" then as "UTF-16". I suppose it should also be 10. But actually it's 12.

I'm using MAC and my region is HongKong. Would you help to explain what's happening in the program, and how "5 12" actually came out?

Thanks a lot!

2 Answers 2

5

(1) I expected that the first println should get "10" as I mentioned. But why 5?

You take a 5 character string, encode it as bytes using UTF-16 encoding.
Then you create a new string by decoding the bytes (correctly) from UTF-16, which gives you a new string consisting of your original 5 characters again.

(2) For the second println, I am trying to getBytes for it first as "UTF-8" then as "UTF-16". I suppose it should also be 10. But actually it's 12.

This part of the code:

    new String(h.getBytes("UTF-8"), "UTF-8")

is actually a no-op. It is just a rather expensive way to copy a string. You encode the string to bytes using UTF-8 as the encoding scheme, and then you create a new string by decoding the UTF-8 encoded bytes.

So effectively, you are doing this:

    "hello".getBytes("UTF-16").length

The reason for the extra 2 bytes is that UTF-16 encoding puts a BOM (byte order mark) as the first (2 byte) code unit.

For more information, read the Unicode FAQs on "UTF-8, UTF-16, UTF-32 & BOM".

Sign up to request clarification or add additional context in comments.

Comments

3

I expected that the first println should get "10" as I mentioned. But why 5?

You are calling length() on the String, not on the byte[]. So this will give you the length of the String in characters (at least as long as we are staying in the Unicode Basic Multilingual Plane -- this unfortunately breaks down when you have characters that need variable-length encoding even in UTF-16).

Once you have a String, it does not matter what encoding was used to create it. length is always given in terms of characters.

If you converted this into a byte[] using UTF-16, you might rightfully have expected 10 (for the five characters times two bytes each) -- that it actually ends up being 12 is due to a Byte Order Mark being included.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.