File encoding : saved content is different than when read

Question

I have a slight problem trying to save a file in java. For some reason the content I get after saving my file is different from what I have when I read it.

I guess this is related to file encoding, but without being sure.

Here is test code I put together. The idea is basically to read a file, and save it again. When I open both files, they are different.

package workspaceFun;

import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStreamReader;

import org.apache.commons.codec.DecoderException;

public class FileSaveTest {

    public static void main(String[] args) throws IOException, DecoderException{

        String location = "test.location";
        File locationFile = new File(location);

        FileInputStream fis = new FileInputStream(locationFile);

        InputStreamReader r = new InputStreamReader(fis, Charset.forName("UTF-8"));
        System.out.println(r.getEncoding());


        StringBuilder builder = new StringBuilder();
        int ch;
        while((ch = fis.read()) != -1){
            builder.append((char)ch);
        }

        String fullLocationString = builder.toString();             

        //Now we want to save back
        FileOutputStream fos = new FileOutputStream("C:/Users/me/Desktop/test");
        byte[] b = fullLocationString.getBytes();
        fos.write(b);
        fos.close();
        r.close();
    }
}

An extract from the input file (opened as plain text using Sublime 2):

40b1 8b81 23bc 0014 1a25 96e7 a393 be1e

and from the output file :

40c2 b1c2 8bc2 8123 c2bc 0014 1a25 c296

The getEncoding method returns "UTF8". Trying to save the output file using the same charset doest not seem to solve the issue.

What puzzles me is that when I try to read the input file using Hex from apache.commons.codec like this :

String hexLocationString2 = Hex.encodeHexString(fullLocationString.getBytes("UTF-8"));

The String already looks like my output file, not the input.

Would you have any idea on what can go wrong? Thanks

Extra info for those being interested, I am trying to read an eclipse .location file.

EDIT: I placed the file online so that you can test the code

I believe if no Charset is set up it defaults to your default charset (in your case UTF-8) try adding a Charset as the second parameter in your InputStreamReader — gtgaxiola
– gtgaxiola, Commented Sep 18, 2014 at 19:14
The InputStreamReader is only used to see the encoding. I does not do any processing. — jlengrand
– jlengrand, Commented Sep 18, 2014 at 19:15
Ok. Well, I have tried with UTF-8 too :). No change in the issue, sadly — jlengrand
– jlengrand, Commented Sep 18, 2014 at 19:18
Probably need an OutputStreamWriter to set the Charset of the FileOutputStream — gtgaxiola
– gtgaxiola, Commented Sep 18, 2014 at 19:21

gtgaxiola · Accepted Answer · 2014-09-18 19:49:03Z

1

I believe is the way you are reading the stream.

You are using FileInputStream directly to read the content instead of wrapping it in the InputStreamReader

By using the InputStreamReader you may determine which Charset to use.

Take in consideration that the Charset defined in the InputStream must be the same you expect as InputStream doesn't detect charsets, it just reads them in that specific format.

Try the following changes:

InputStreamReader r = new InputStreamReader(new FileInputStream(locationFile), StandardCharsets.UTF_8);

then instead of fos.read() use r.read()

Finally when writing the String get the bytes in the same Charset as your Reader

FileOutputStream fos = new FileOutputStream("C:/Users/me/Desktop/test");        
fos.write(fullLocationString.getBytes(StandardCharsets.UTF_8));
fos.close()

edited Sep 18, 2014 at 19:49

answered Sep 18, 2014 at 19:41

gtgaxiola

9,3165 gold badges47 silver badges67 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

jlengrand Over a year ago

Hum, there is some change indeed when i use the inputstreamreader instead of the fileinputstreamdirectly. But the outcome is still not the same though :S 40ef bfbd efbf bdef bfbd 23ef bfbd 0014

gtgaxiola Over a year ago

Can you detect which encoding is the original file?

jlengrand Over a year ago

The inputstreamreader tells me it is utf8. Which is why I am confused. I added a link in the post so you can download the file i you want. Thanks for the help

gtgaxiola Over a year ago

Here lies the misconception... InputStreamReader doesn't "DETECT" encoding it just reads the bytes in the encoding you SPECIFY... try changing the encoding from UTF_8 to ISO_8859_1

gtgaxiola Over a year ago

btw... using ISO_8859_1 and then doing a file compare between your file and my output (fc command in Windows Command Line) yield: FC: no differences encountered..

|

Mykola Evpak · Accepted Answer · 2014-09-18 21:01:29Z

0

Try to read and write back as below:

public class FileSaveTest {

    public static void main(String[] args) throws IOException {

        String location = "D:\\test.txt";

        BufferedReader br = new BufferedReader(new FileReader(location));
        StringBuilder sb = new StringBuilder();

        try {
            String line = br.readLine();

            while (line != null) {
                sb.append(line);
                line = br.readLine();

                if (line != null)
                    sb.append(System.lineSeparator());
            }

        } finally {
            br.close();
        }

        FileOutputStream fos = new FileOutputStream("D:\\text_created.txt");
        byte[] b = sb.toString().getBytes();
        fos.write(b);
        fos.close();

    }
}

Test file contains both Cirillic and Latin characters.

SDFASDF
XXFsd1
12312
іва

edited Sep 18, 2014 at 21:01

answered Sep 18, 2014 at 20:48

Mykola Evpak

1566 bronze badges

2 Comments

Karol S Over a year ago

Please, please, never use FileReader and .getBytes() without encoding!

Mykola Evpak Over a year ago

Sure encoding is very important. I did this as an example (first with encoding) then I delete encoding and it worked also. (with such approach as was proposed by creator of this question my file couldn't be read, and as you may see he used encoding)

Collectives™ on Stack Overflow

File encoding : saved content is different than when read

2 Answers 2

6 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related