1

I would like to read and print the text file to console so i did this with below code

File file = new File("G:\\text.txt");
FileReader fileReader = new FileReader(file);
int ascii = fileReader.read();

while (ascii != -1)
{
result = result + (char) ascii;
ascii = fileReader.read();
}
System.out.println(result);

although i got correct result, but in some cases i will get some strange result. Suppose my text file has this text in it:

Hello to every one

In order to have a text file I've used notepad, and when i change the encoding mode i will get strange output from my code.

Ansi : Hello to every one

Unicode : ÿþh e l l o t o e v e r y o n e

Unicode big endian: þÿ h e l l o t o e v e r y o n e

UTF-8 : hello to every one

Why do i get these strange output? Is there any problem with my code? Or there are other reasons

7
  • 2
    Because of the encoding mode? You already mentioned that it happens when you change the encoding mode.. Commented Jun 23, 2015 at 6:08
  • @Gosu: yes as you can see, when i changed the encoding mode, i get different results Commented Jun 23, 2015 at 6:09
  • Use a InputStreamReader together with the correct encoding mode instead? Commented Jun 23, 2015 at 6:10
  • 2
    @ElyasHadizadeh What do you think different encodings are used for? If they all gave the same result, we'd only need a single encoding. You're also using the correct term (encoding) for the last one of your examples (UTF-8). Ansi is not an encoding, and the ones you term unicode are actually UTF-16LE and UTF-16BE. Unicode is the charset, encodings are different ways of storing the characters as bytes. Commented Jun 23, 2015 at 6:14
  • 1
    @ElyasHadizadeh This is a pretty good read: joelonsoftware.com/articles/Unicode.html Commented Jun 23, 2015 at 6:45

1 Answer 1

5

Your file starts with a byte-order mark (U+FEFF). It should only occur in the first character of the file - it's not terribly widely used, but various Windows tools do include it, including Notepad. You can just strip it from the start of the first line.

As an aside, I'd strongly recommend not using FileReader - it doesn't allow you to specify the encoding. I'd use Files.newBufferedReader, and either specify the encoding or let it default to UTF-8 (rather than the system default encoding which FileReader uses). When you're using BufferedReader, you can then just read a line at a time with readLine() too:

 String line;
 while ((line = reader.readLine()) != null) {
     System.out.println(line.replace("\uFEFF", ""));
 }

If you really want to read a character at a time, it's worth getting in the habit of using a StringBuilder instead of repeated string concatenation in a loop. Also note that your variable name of ascii is misleading: it's actually the UTF-16 code unit, which may or may not be an ASCII character.

The encoding you specify should match the encoding used to write the file - at that point you should see the correct output instead of an extra character between each "real" character when using Unicode and Unicode big endian.

Sign up to request clarification or add additional context in comments.

4 Comments

it's seems your answer is true, could you please compose the correct way of using Files.newBufferedReader ?!
@ElyasHadizadeh: Well have you looked at the documentation, and tried using it yourself? It's very important to be able to do your own research.
yes you are true completely, thank you for your advice and answer ;-)
Jon Skeet: Thank you again very very much, i found the correct way, and actually this line of code : line.replace("\uFEFF", "") was very helpful

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.