3

I have a simple test

@Test
public void utf16SizeTest() throws Exception {
    final String test = "п";
    // 'п' = U+043F according to unicode table
    // 43F to binary = 0100 0011 1111 (length is 11)
    // ADD '0' so length should be = 16
    // 0000 0100 0011 1111
    // 00000100(2) 00111111(2)
    //    4(10)  63(10)
    final byte[] bytes = test.getBytes("UTF-16");
    for (byte aByte : bytes) {
        System.out.println(aByte);
    }
}

As you can see I firstly convert 'п' to binary and then add as many empty bites while length != 16.

A expect that output will be 4 , 63

But actual one was:

-2
-1
4
63

What am I doing wrong?

1

1 Answer 1

11

If you try:

final String test = "ппп";

you will find -2 -1 only appears at the beginning:

-2
-1
4
63
4
63
4
63

-2 is 0xFE and -1 is 0xFF. Together, they form a BOM (Byte_order_mark):

In UTF-16, a BOM (U+FEFF) may be placed as the first character of a file or character stream to indicate the endianness (byte order) of all the 16-bit code units of the file or stream. If an attempt is made to read this stream with the wrong endianness, the bytes will be swapped, thus delivering the character U+FFFE, which is defined by Unicode as a "non character" that should never appear in the text.

test.getBytes("UTF-16"); defaults to using Big Endian when encoding the bytes, so a BOM is included in front so later processors can know that Big Endian was used.

You can explicitly specify endian by using UTF-16LE or UTF-16BE instead, thus avoiding a BOM in the output:

final byte[] bytes = test.getBytes("UTF-16BE");

The UTF-16 charsets use sixteen-bit quantities and are therefore sensitive to byte order. In these encodings the byte order of a stream may be indicated by an initial byte-order mark represented by the Unicode character '\uFEFF'. Byte-order marks are handled as follows:

  • When decoding, the UTF-16BE and UTF-16LE charsets interpret the initial byte-order marks as a ZERO-WIDTH NON-BREAKING SPACE; when encoding, they do not write byte-order marks.

  • When decoding, the UTF-16 charset interprets the byte-order mark at the beginning of the input stream to indicate the byte-order of the stream but defaults to big-endian if there is no byte-order mark; when encoding, it uses big-endian byte order and writes a big-endian byte-order mark.

Sign up to request clarification or add additional context in comments.

1 Comment

I'd like to highlight the fact that starting from Java 7, there is a class called java.nio.charset.StandardCharsets that brings his pack of constants for each charset that is often used. It prevents all typo errors in the charsetName. Charsets exposed: * US-ASCII * ISO-8859-1 * UTF-8 * UTF-16BE * UTF-16LE * UTF-16

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.