Java string "hello" has 12 bytes when getBytes("UTF-16")?

Question

I expected that, when a java character is stored as "UTF-16", each character uses 2 bytes, so "hello" should consume 10 bytes, but this code:

String h = "hello";
System.out.println(new String(h.getBytes("UTF-16"), "UTF-16").length());
System.out.println(new String(h.getBytes("UTF-8"), "UTF-8").getBytes("UTF-16").length);

Will print "5 12"

My question:

(1) I expected that the first println should get "10" as I mentioned. But why 5?

(2) For the second println, I am trying to getBytes for it first as "UTF-8" then as "UTF-16". I suppose it should also be 10. But actually it's 12.

I'm using MAC and my region is HongKong. Would you help to explain what's happening in the program, and how "5 12" actually came out?

Thanks a lot!

Stephen C · Accepted Answer · 2018-12-01 06:56:35Z

(1) I expected that the first println should get "10" as I mentioned. But why 5?

You take a 5 character string, encode it as bytes using UTF-16 encoding.
Then you create a new string by decoding the bytes (correctly) from UTF-16, which gives you a new string consisting of your original 5 characters again.

(2) For the second println, I am trying to getBytes for it first as "UTF-8" then as "UTF-16". I suppose it should also be 10. But actually it's 12.

This part of the code:

    new String(h.getBytes("UTF-8"), "UTF-8")

is actually a no-op. It is just a rather expensive way to copy a string. You encode the string to bytes using UTF-8 as the encoding scheme, and then you create a new string by decoding the UTF-8 encoded bytes.

So effectively, you are doing this:

    "hello".getBytes("UTF-16").length

The reason for the extra 2 bytes is that UTF-16 encoding puts a BOM (byte order mark) as the first (2 byte) code unit.

For more information, read the Unicode FAQs on "UTF-8, UTF-16, UTF-32 & BOM".

Thilo · Accepted Answer · 2018-12-01 07:14:07Z

3

I expected that the first println should get "10" as I mentioned. But why 5?

You are calling length() on the String, not on the byte[]. So this will give you the length of the String in characters (at least as long as we are staying in the Unicode Basic Multilingual Plane -- this unfortunately breaks down when you have characters that need variable-length encoding even in UTF-16).

Once you have a String, it does not matter what encoding was used to create it. length is always given in terms of characters.

If you converted this into a byte[] using UTF-16, you might rightfully have expected 10 (for the five characters times two bytes each) -- that it actually ends up being 12 is due to a Byte Order Mark being included.

edited Dec 1, 2018 at 7:14

answered Dec 1, 2018 at 7:07

Thilo

264k107 gold badges527 silver badges674 bronze badges

Collectives™ on Stack Overflow

Java string "hello" has 12 bytes when getBytes("UTF-16")?

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related