0

I have a slight problem trying to save a file in java. For some reason the content I get after saving my file is different from what I have when I read it.

I guess this is related to file encoding, but without being sure.

Here is test code I put together. The idea is basically to read a file, and save it again. When I open both files, they are different.

package workspaceFun;

import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStreamReader;

import org.apache.commons.codec.DecoderException;

public class FileSaveTest {

    public static void main(String[] args) throws IOException, DecoderException{

        String location = "test.location";
        File locationFile = new File(location);

        FileInputStream fis = new FileInputStream(locationFile);

        InputStreamReader r = new InputStreamReader(fis, Charset.forName("UTF-8"));
        System.out.println(r.getEncoding());


        StringBuilder builder = new StringBuilder();
        int ch;
        while((ch = fis.read()) != -1){
            builder.append((char)ch);
        }

        String fullLocationString = builder.toString();             

        //Now we want to save back
        FileOutputStream fos = new FileOutputStream("C:/Users/me/Desktop/test");
        byte[] b = fullLocationString.getBytes();
        fos.write(b);
        fos.close();
        r.close();
    }
}

An extract from the input file (opened as plain text using Sublime 2):

40b1 8b81 23bc 0014 1a25 96e7 a393 be1e

and from the output file :

40c2 b1c2 8bc2 8123 c2bc 0014 1a25 c296

The getEncoding method returns "UTF8". Trying to save the output file using the same charset doest not seem to solve the issue.

What puzzles me is that when I try to read the input file using Hex from apache.commons.codec like this :

String hexLocationString2 = Hex.encodeHexString(fullLocationString.getBytes("UTF-8"));

The String already looks like my output file, not the input.

Would you have any idea on what can go wrong? Thanks

Extra info for those being interested, I am trying to read an eclipse .location file.

EDIT: I placed the file online so that you can test the code

5
  • I believe if no Charset is set up it defaults to your default charset (in your case UTF-8) try adding a Charset as the second parameter in your InputStreamReader Commented Sep 18, 2014 at 19:14
  • The InputStreamReader is only used to see the encoding. I does not do any processing. Commented Sep 18, 2014 at 19:15
  • Ok. Well, I have tried with UTF-8 too :). No change in the issue, sadly Commented Sep 18, 2014 at 19:18
  • Probably need an OutputStreamWriter to set the Charset of the FileOutputStream Commented Sep 18, 2014 at 19:21
  • Just tried, doesnt change anything either :S Commented Sep 18, 2014 at 19:36

2 Answers 2

1

I believe is the way you are reading the stream.

You are using FileInputStream directly to read the content instead of wrapping it in the InputStreamReader

By using the InputStreamReader you may determine which Charset to use.

Take in consideration that the Charset defined in the InputStream must be the same you expect as InputStream doesn't detect charsets, it just reads them in that specific format.

Try the following changes:

InputStreamReader r = new InputStreamReader(new FileInputStream(locationFile), StandardCharsets.UTF_8);

then instead of fos.read() use r.read()

Finally when writing the String get the bytes in the same Charset as your Reader

FileOutputStream fos = new FileOutputStream("C:/Users/me/Desktop/test");        
fos.write(fullLocationString.getBytes(StandardCharsets.UTF_8));
fos.close()
Sign up to request clarification or add additional context in comments.

6 Comments

Hum, there is some change indeed when i use the inputstreamreader instead of the fileinputstreamdirectly. But the outcome is still not the same though :S 40ef bfbd efbf bdef bfbd 23ef bfbd 0014
Can you detect which encoding is the original file?
The inputstreamreader tells me it is utf8. Which is why I am confused. I added a link in the post so you can download the file i you want. Thanks for the help
Here lies the misconception... InputStreamReader doesn't "DETECT" encoding it just reads the bytes in the encoding you SPECIFY... try changing the encoding from UTF_8 to ISO_8859_1
btw... using ISO_8859_1 and then doing a file compare between your file and my output (fc command in Windows Command Line) yield: FC: no differences encountered..
|
0

Try to read and write back as below:

public class FileSaveTest {

    public static void main(String[] args) throws IOException {

        String location = "D:\\test.txt";

        BufferedReader br = new BufferedReader(new FileReader(location));
        StringBuilder sb = new StringBuilder();

        try {
            String line = br.readLine();

            while (line != null) {
                sb.append(line);
                line = br.readLine();

                if (line != null)
                    sb.append(System.lineSeparator());
            }

        } finally {
            br.close();
        }

        FileOutputStream fos = new FileOutputStream("D:\\text_created.txt");
        byte[] b = sb.toString().getBytes();
        fos.write(b);
        fos.close();

    }
}

Test file contains both Cirillic and Latin characters.

SDFASDF
XXFsd1
12312
іва

2 Comments

Please, please, never use FileReader and .getBytes() without encoding!
Sure encoding is very important. I did this as an example (first with encoding) then I delete encoding and it worked also. (with such approach as was proposed by creator of this question my file couldn't be read, and as you may see he used encoding)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.