3

I open file with notepad, write there: "ą" save and close.

I try to read this file in two ways

First:

        InputStream inputStream = Files.newInputStream(Paths.get("file.txt"));
        int result = inputStream.read();
        System.out.println(result);
        System.out.println((char) result);

196 Ä

Second:

        InputStream inputStream = Files.newInputStream(Paths.get("file.txt"));
        Reader reader = new InputStreamReader(inputStream);
        int result = reader.read();
        System.out.println(result);
        System.out.println((char) result);

261 ą

Questions: 1) In binary mode, this letter is saved as 196? Why not as 261? 2) This letter is saved as 196 in which encoding?

I try to understand why there are differences

2
  • 1
    In which encoding did you save your one character file in Notepad? Commented Aug 3, 2018 at 11:02
  • Notepad++ show "utf-8 without bom" but I dont see in utf-8 table 'ą' letter with code: 196 Commented Aug 3, 2018 at 11:06

4 Answers 4

4

UTF-8 encodes values from range U+0080 - U+07FF as two bytes in form 110xxxxx 10xxxxxx (more at wiki). So there are only xxxxx xxxxxx 11 bytes available for value.

ą is indexed as U+0105 where 0105 is hexadecimal value (as decimal it is 261). As binary it can be represented as

      01       05    (hex)
00000001 00000101    (bin)
     xxx xxxxxxxx <- values for U+0080 - U+07FF range encode only those bits

     001 00000101 <- which means `x` will be replaced by only this part 

So UTF-8 encoding will add 110xxxxx 10xxxxxx mask which means it will combine

110xxxxx 10xxxxxx
   00100   000101

into (two bytes):

11000100 10000101

Now, InputStream reads data as raw bytes. So when you call inputStream.read(); first time you are getting 11000100 which is 196 in decimal. Calling inputStream.read(); second time would return 10000101 which is 133 in decimal.

Readers ware introduced in Java 1.1 so we could avoid this kind of mess in our code. Instead we can specify what encoding Reader should use (or let it use default one) to get properly encoded values like in this case 00000001 00000101 (without mask) which is equal to 0105 in hexadecimal form and 261 in decimal form.


In short

  • use Readers (with properly specified encoding) if you want to read data as text,
  • use Streams if you want to read data as raw bytes.
Sign up to request clarification or add additional context in comments.

Comments

1

Because you read these two letters in different encodings, you can check your encoding via InputStreamReader::getEncoding.

String s = "ą";

char iso_8859_1 = new String(s.getBytes(), "iso-8859-1").charAt(0);
char utf_8 = new String(s.getBytes(), "utf-8").charAt(0);   

System.out.println((int) iso_8859_1 + " " + iso_8859_1);
System.out.println((int) utf_8 + " " + utf_8);

The output is

196 Ä
261 ą

Comments

0

Try using an InputStreamReader with UTF-8 encoding, which matches the encoding used to write the file from Notepad++:

// this will use UTF-8 encoding by default
BufferedReader in = Files.newBufferedReader(Paths.get("file.txt"));

String str;
if ((str = in.readLine()) != null) {
    System.out.println(str);
}
in.close();

I don't have an exact/reproducible answer for why you are seeing the output you see, but if you are reading with the wrong encoding, you won't necessarily see what you saved. For example, if the single character ą were encoded with two bytes, but you read as ASCII, then you might get back two characters, which would not match your original file.

2 Comments

The character ą encodes to two bytes (C4 85), with C4 being 196 in decimal. So they are indeed reading the first byte only.
@TiiJ7 Thanks for the detective work. I wouldn't have known how to figure this out, but the encoding problem is the first thing which jumped out at me :-)
0

You are getting decimal value of LATIN letters You need to save the file with UTF-8 encoding standard.

Make sure when you are reading them with similar standards.

0x0105 261 LATIN SMALL LETTER A WITH OGONEK ą

0x00C4 196 LATIN CAPITAL LETTER A WITH DIAERESIS �

Refer this:-https://www.ssec.wisc.edu/~tomw/java/unicode.html

1 Comment

The OP already is cognizant of that. So, he doesn't mean that.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.