1

I am quite perplexed on why I should not be encoding unicode text with UTF-8 for comparison when other text(to compare) has been encoded with UTF-8?

I wanted to compare a text(= アクセス拒否 - means Access denied) stored in external file encoded as UTF-8 with a constant string stored in a .java file as

public static final String ACCESS_DENIED_IN_JAPANESE = "\u30a2\u30af\u30bb\u30b9\u62d2\u5426"; // means Access denied 

The java file was encoded as Cp1252.

I read the file as as input stream by using below code. Point to note that I am using UTF-8 for encoding.

 InputStream in = new FileInputStream("F:\\sample.txt");
        int b1; 
        byte[] bytes = new byte[4096];
        int i = 0;
        while (true) {
            b1 = in.read();
            if (b1 == -1)
                break;
            bytes[i++] = (byte) b1;
        }

        String japTextFromFile = new String(bytes, 0, i, Charset.forName("UTF-8"));

Now when I compare as

System.out.println(ACCESS_DENIED_IN_JAPANESE.equals(japTextFromFile));  // result is `true` , and works fine

but when I encode ACCESS_DENIED_IN_JAPANESE with UTF-8 and try to compare it with japTextFromFile result is false. The code is

String encodedAccessDenied = new String(ACCESS_DENIED_IN_JAPANESE.getBytes(),Charset.forName("UTF-8"));

System.out.println(encodedAccessDenied .equals(japTextFromFile));  // result is `false`

So my doubt is why above comparison is failing, when both the strings are same and have been encoded with UTF-8? The result should be true.

However, in first case, when compared different encoded strings- one with UTF-16(Java default way of encoding string) and other with UTF-8 , result is true, which I think should be false as it is different encoding ,no matter text we read, is same.

Where I am wrong in my understanding? Any clarification is greatly appreciated.

4
  • What is your default character set? What do you think ACCESS_DENIED_IN_JAPANESE.getBytes() does? Commented Oct 1, 2015 at 19:08
  • @Sotirios Delimanolis : default character set - I need to check in my office workstation so not sure. It will return bytes array with java platform default charset(as explained by java doc). Commented Oct 1, 2015 at 21:06
  • Do you have a line feed in sample.txt? Commented Oct 2, 2015 at 17:24
  • If you're using Java 7, you might want to consider using the super handy Files.readAllLines() method like: Files.readAllLines("F:\\sample.txt", Charset.forName("UTF-8")).get(0) Commented Oct 2, 2015 at 17:32

2 Answers 2

2

ACCESS_DENIED_IN_JAPANESE.getBytes() does not use UTF-8. It uses your platform's default charset. But then you use UTF-8 to turn those bytes back into a String. This gets you a different String to the one you started with.

Try this:

String encodedAccessDenied = new String(ACCESS_DENIED_IN_JAPANESE.getBytes(StandardCharsets.UTF_8),StandardCharsets.UTF_8
);

System.out.println(encodedAccessDenied .equals(japTextFromFile));  // result is `true`
Sign up to request clarification or add additional context in comments.

2 Comments

Note that doing this new String(utf8Bytes, utf8charset) dance is basically a no-op.
@Jonathan :Yes, I am getting your point. Simply by encoding with UTF-8 while reading through 'String.getBytes("UTF-8")' might result in expected output. Need to test this. Thanks!!!
0

The best way I know is put all static texts into a text file encoded with UTF-8. And then read those resources with FileReader, setting encoding parameter to "UTF-8"

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.