1

I do not understand why this code is not outputting the same thing? I thought the Java automatically figures out the encoding of the string?

public static void main (String[] args) {
    try {
        displayStringAsHex("A B C \u03A9".getBytes("UTF-8"));
        System.out.println ("");
        displayStringAsHex("A B C \u03A9".getBytes("UTF-16"));
    } catch (UnsupportedEncodingException ex) {
        ex.printStackTrace();
    }
}

/** 
 * I got part of this from: http://rgagnon.com/javadetails/java-0596.html
 */
public static void displayStringAsHex(byte[] raw ) {
    String HEXES = "0123456789ABCDEF";
    System.out.println("raw = " + new String(raw));
    final StringBuilder hex = new StringBuilder( 2 * raw.length );
    for ( final byte b : raw ) {
      hex.append(HEXES.charAt((b & 0xF0) >> 4))
         .append(HEXES.charAt((b & 0x0F))).append(" ");
    }
    System.out.println ("hex.toString() = "+ hex.toString());
}

outputs:

(UTF-8)
hex.toString() = 41 20 42 20 43 20 CE A9 

(UTF 16)
hex.toString() = FE FF 00 41 00 20 00 42 00 20 00 43 00 20 03 A9

I cannot display the character output, but the UTF-8 version looks correct. The UTF-16 version has several squares and blocks.

Why don't they look the same?

2
  • Why would they output the same thing? UTF-8 and UTF-16 are two completely different encoding schemes. And this has nothing to do with "Java automatically figuring out the encoding". It's a matter of whether whatever you're using to display that encoded text can figure out the encoding or not. Commented Apr 5, 2014 at 4:39
  • Actually they look the same if you notice the first UTF-8 string patterns occurs in the second string UTF-16, check the sequence: 41 20 42 20 43 20 since UTF-16 addresses the double of size than UTF-8 it can map a wider variety of languages: perhaps the answer of this question may help: stackoverflow.com/questions/4655250/… Commented Apr 5, 2014 at 4:43

1 Answer 1

2

Java does not automatically figure out the encoding of a string.

The String(byte[]) constructor

constructs a new String by decoding the specified array of bytes using the platform's default charset.`

In your case the UTF-16 bytes are being interpreted as UTF-8 and you end up with garbage. Use new String(raw, Charset.forName("UTF-16")) to rebuild the String.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.