What encoding Java uses to create string from give unicode data?

Question

I am quite perplexed on why I should not be encoding unicode text with UTF-8 for comparison when other text(to compare) has been encoded with UTF-8?

I wanted to compare a text(= アクセス拒否 - means Access denied) stored in external file encoded as UTF-8 with a constant string stored in a .java file as

public static final String ACCESS_DENIED_IN_JAPANESE = "\u30a2\u30af\u30bb\u30b9\u62d2\u5426"; // means Access denied

The java file was encoded as Cp1252.

I read the file as as input stream by using below code. Point to note that I am using UTF-8 for encoding.

 InputStream in = new FileInputStream("F:\\sample.txt");
        int b1; 
        byte[] bytes = new byte[4096];
        int i = 0;
        while (true) {
            b1 = in.read();
            if (b1 == -1)
                break;
            bytes[i++] = (byte) b1;
        }

        String japTextFromFile = new String(bytes, 0, i, Charset.forName("UTF-8"));

Now when I compare as

System.out.println(ACCESS_DENIED_IN_JAPANESE.equals(japTextFromFile));  // result is `true` , and works fine

but when I encode ACCESS_DENIED_IN_JAPANESE with UTF-8 and try to compare it with japTextFromFile result is false. The code is

String encodedAccessDenied = new String(ACCESS_DENIED_IN_JAPANESE.getBytes(),Charset.forName("UTF-8"));

System.out.println(encodedAccessDenied .equals(japTextFromFile));  // result is `false`

So my doubt is why above comparison is failing, when both the strings are same and have been encoded with UTF-8? The result should be true.

However, in first case, when compared different encoded strings- one with UTF-16(Java default way of encoding string) and other with UTF-8 , result is true, which I think should be false as it is different encoding ,no matter text we read, is same.

Where I am wrong in my understanding? Any clarification is greatly appreciated.

What is your default character set? What do you think ACCESS_DENIED_IN_JAPANESE.getBytes() does? — Sotirios Delimanolis
– Sotirios Delimanolis, Commented Oct 1, 2015 at 19:08
@Sotirios Delimanolis : default character set - I need to check in my office workstation so not sure. It will return bytes array with java platform default charset(as explained by java doc). — fiberair
– fiberair, Commented Oct 1, 2015 at 21:06
If you're using Java 7, you might want to consider using the super handy Files.readAllLines() method like: Files.readAllLines("F:\\sample.txt", Charset.forName("UTF-8")).get(0) — Alastair McCormack
– Alastair McCormack, Commented Oct 2, 2015 at 17:32

Jonathan · Accepted Answer · 2015-10-01 19:12:09Z

2

ACCESS_DENIED_IN_JAPANESE.getBytes() does not use UTF-8. It uses your platform's default charset. But then you use UTF-8 to turn those bytes back into a String. This gets you a different String to the one you started with.

Try this:

String encodedAccessDenied = new String(ACCESS_DENIED_IN_JAPANESE.getBytes(StandardCharsets.UTF_8),StandardCharsets.UTF_8
);

System.out.println(encodedAccessDenied .equals(japTextFromFile));  // result is `true`

answered Oct 1, 2015 at 19:12

Jonathan

3491 silver badge9 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Sotirios Delimanolis Over a year ago

Note that doing this new String(utf8Bytes, utf8charset) dance is basically a no-op.

fiberair Over a year ago

@Jonathan :Yes, I am getting your point. Simply by encoding with UTF-8 while reading through 'String.getBytes("UTF-8")' might result in expected output. Need to test this. Thanks!!!

krzydyn · Accepted Answer · 2015-10-01 19:19:32Z

0

The best way I know is put all static texts into a text file encoded with UTF-8. And then read those resources with FileReader, setting encoding parameter to "UTF-8"

answered Oct 1, 2015 at 19:19

krzydyn

1,0329 silver badges20 bronze badges

Collectives™ on Stack Overflow

What encoding Java uses to create string from give unicode data?

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related