Java UTF-16 String always use 4 bytes instead of 2 bytes

Question

I have a simple test

@Test
public void utf16SizeTest() throws Exception {
    final String test = "п";
    // 'п' = U+043F according to unicode table
    // 43F to binary = 0100 0011 1111 (length is 11)
    // ADD '0' so length should be = 16
    // 0000 0100 0011 1111
    // 00000100(2) 00111111(2)
    //    4(10)  63(10)
    final byte[] bytes = test.getBytes("UTF-16");
    for (byte aByte : bytes) {
        System.out.println(aByte);
    }
}

As you can see I firstly convert 'п' to binary and then add as many empty bites while length != 16.

A expect that output will be 4 , 63

But actual one was:

What am I doing wrong?

en.wikipedia.org/wiki/UTF-16#Byte_order_encoding_schemes

JB Nizet
– JB Nizet

2018-07-08 08:44:46 +00:00
Commented Jul 8, 2018 at 8:44 — JB Nizet
– JB Nizet, Commented Jul 8, 2018 at 8:44

Remy Lebeau · Accepted Answer · 2018-07-09 19:47:11Z

11

If you try:

final String test = "ппп";

you will find -2 -1 only appears at the beginning:

-2 is 0xFE and -1 is 0xFF. Together, they form a BOM (Byte_order_mark):

In UTF-16, a BOM (U+FEFF) may be placed as the first character of a file or character stream to indicate the endianness (byte order) of all the 16-bit code units of the file or stream. If an attempt is made to read this stream with the wrong endianness, the bytes will be swapped, thus delivering the character U+FFFE, which is defined by Unicode as a "non character" that should never appear in the text.

test.getBytes("UTF-16"); defaults to using Big Endian when encoding the bytes, so a BOM is included in front so later processors can know that Big Endian was used.

You can explicitly specify endian by using UTF-16LE or UTF-16BE instead, thus avoiding a BOM in the output:

final byte[] bytes = test.getBytes("UTF-16BE");

The UTF-16 charsets use sixteen-bit quantities and are therefore sensitive to byte order. In these encodings the byte order of a stream may be indicated by an initial byte-order mark represented by the Unicode character '\uFEFF'. Byte-order marks are handled as follows:

When decoding, the UTF-16BE and UTF-16LE charsets interpret the initial byte-order marks as a ZERO-WIDTH NON-BREAKING SPACE; when encoding, they do not write byte-order marks.

When decoding, the UTF-16 charset interprets the byte-order mark at the beginning of the input stream to indicate the byte-order of the stream but defaults to big-endian if there is no byte-order mark; when encoding, it uses big-endian byte order and writes a big-endian byte-order mark.

edited Jul 9, 2018 at 19:47

Remy Lebeau

610k36 gold badges516 silver badges875 bronze badges

answered Jul 8, 2018 at 8:48

xingbin

28.4k12 gold badges62 silver badges111 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Arnaud JOLLY Over a year ago

I'd like to highlight the fact that starting from Java 7, there is a class called java.nio.charset.StandardCharsets that brings his pack of constants for each charset that is often used. It prevents all typo errors in the charsetName. Charsets exposed: * US-ASCII * ISO-8859-1 * UTF-8 * UTF-16BE * UTF-16LE * UTF-16

Collectives™ on Stack Overflow

Java UTF-16 String always use 4 bytes instead of 2 bytes

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related