1

I'd like to download the sources of many webpages, then write to the file and print it out in the NetBeans console. I have a problem with encoding. First check my code out:

public static final void foo(URL url, Charset endoding, String file) {
    BufferedReader in;
    String readLine;
    try
    {
        in = new BufferedReader(new InputStreamReader(url.openStream(), encoding));
        BufferedWriter out = new BufferedWriter(new OutputStreamWriter( new FileOutputStream(file) , encoding));
        while ((readLine = in.readLine()) != null) {
            System.out.println(readLine+"\n");
            out.write(readLine+"\n");
        }
        out.flush();
        out.close();
    }
}

I am testing this on 2 foreign websites (ex. Czech and Thai)

I tried Charset.forName("UTF-8") that seems to work correctly for the Thai webpage but actually for the Czech one doesn't. Console and file contains the question mark such as �.

I have also tried ISO-8859-2, that saves the file correctly, but the console shows small rectangle instead of letters ž, š etc..

Does exist any universal solution for multilanguage websites (as Czech, Japan, Thai and more..), that I can save to file correctly as same as print to console or save to variable?

1 Answer 1

2

The problem is that there is no such thing as the ultimate encoding. The state of the art encoding would probably be UTF-8 at the time, even though each side can decide which encoding it is using by its own. Here is a pretty decent article worth of reading that describes the basic problem of char encoding as a world wide solution.

Therefore, the best Solution would be to get the html page encoding with InputStreamReader.getEncoding():

public static final void foo(URL url, String file){
  BufferedReader in;
  String readLine;
  try{
    InputStreamReader isr = new InputStreamReader(url.openStream());
    String encoding = isr.getEncoding(); //if you actually need it, which I don't suppose
    in = new BufferedReader(isr);
    BufferedWriter out = new BufferedWriter(new OutputStreamWriter( new FileOutputStream(file) , encoding));
    while ((readLine = in.readLine()) != null) {
      System.out.println(readLine+"\n");
      out.write(readLine+"\n");
    }
    out.flush();
    out.close();
  }
}

This should work as intended.

Sign up to request clarification or add additional context in comments.

1 Comment

Hmh ok then I don't really know what to do. Can you give me your website Urlacher where it failed?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.