How to handle non-UTF8 html page in Java?

Question

My task is to retrieve html strings from urls using Java.

I know how to using HttpUrlConnection & InputStream to get the string.

However, I have an encoding problem for some pages.

If some pages have different encoding (e.g., GB2312), other than UTF8, the string I get is just arbitrary chars or question marks.

Can any one please tell me how to solve this problem?

Thanks

Below is my code to download the html from a url.

private String downloadHtml(String urlString) {
    URL url = null;
    InputStream inStr = null;
    StringBuffer buffer = new StringBuffer();

    try {
        url = new URL(urlString);
        HttpURLConnection conn = (HttpURLConnection) url.openConnection(); // Cast shouldn't fail
        HttpURLConnection.setFollowRedirects(true);
        // allow both GZip and Deflate (ZLib) encodings
        //conn.setRequestProperty("Accept-Encoding", "gzip, deflate"); 
        String encoding = conn.getContentEncoding();
        inStr = null;

        // create the appropriate stream wrapper based on
        // the encoding type
        if (encoding != null && encoding.equalsIgnoreCase("gzip")) {
            inStr = new GZIPInputStream(conn.getInputStream());
        } else if (encoding != null && encoding.equalsIgnoreCase("deflate")) {
            inStr = new InflaterInputStream(conn.getInputStream(),
              new Inflater(true));
        } else {
            inStr = conn.getInputStream();
        }
        int ptr = 0;


        InputStreamReader inStrReader = new InputStreamReader(inStr, Charset.forName("GB2312"));

        while ((ptr = inStrReader.read()) != -1) {
            buffer.append((char)ptr);
        }
        inStrReader.close();

        conn.disconnect();
    }
    catch(Exception e) {

        e.printStackTrace();
    }
    finally {
        if (inStr != null)
            try {
                inStr.close();
            } catch (IOException e) {
                // TODO Auto-generated catch block
                e.printStackTrace();
            }
    }

    return buffer.toString();
}

chrisdotcode · Accepted Answer · 2012-01-25 17:45:50Z

4

By using an InputStreamReader and specifying your charset, like so:

inStr = new InputStreamReader(InputStream, Charset.forName("GB2312"));

The following code worked for me:

import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;
import java.nio.charset.Charset;
import java.util.zip.GZIPInputStream;
import java.util.zip.Inflater;
import java.util.zip.InflaterInputStream;

public class Foo {

public static void main(String[] args) {
    System.out.println(downloadHtml("http://baike.baidu.com/view/6000001.htm"));
}


private static String downloadHtml(String urlString) {
    URL url = null;
    InputStream inStr = null;
    StringBuffer buffer = new StringBuffer();

    try {
        url = new URL(urlString);
        HttpURLConnection conn = (HttpURLConnection) url.openConnection(); // Cast shouldn't fail
        HttpURLConnection.setFollowRedirects(true);
        // allow both GZip and Deflate (ZLib) encodings
        //conn.setRequestProperty("Accept-Encoding", "gzip, deflate"); 
        String encoding = conn.getContentEncoding();
        inStr = null;

        // create the appropriate stream wrapper based on
        // the encoding type
        if (encoding != null && encoding.equalsIgnoreCase("gzip")) {
            inStr = new GZIPInputStream(conn.getInputStream());
        } else if (encoding != null && encoding.equalsIgnoreCase("deflate")) {
            inStr = new InflaterInputStream(conn.getInputStream(),
              new Inflater(true));
        } else {
            inStr = conn.getInputStream();
        }
        int ptr = 0;


        InputStreamReader inStrReader = new InputStreamReader(inStr, Charset.forName("GB2312"));

        while ((ptr = inStrReader.read()) != -1) {
            buffer.append((char)ptr);
        }
        inStrReader.close();

        conn.disconnect();
    }
    catch(Exception e) {

        e.printStackTrace();
    }
    finally {
        if (inStr != null)
            try {
                inStr.close();
            } catch (IOException e) {
                // TODO Auto-generated catch block
                e.printStackTrace();
            }
    }

    return buffer.toString();
  }

}

edited Jan 25, 2012 at 17:45

answered Jan 5, 2012 at 17:40

chrisdotcode

1,6012 gold badges18 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Jack Over a year ago

I have tried your suggestion: return new String(buffer.toString().getBytes(), "GB2312")); but it doesn't work. still wrong encoding for the url baike.baidu.com/view/6000001.htm

chrisdotcode Over a year ago

My apologizes, I have recently confirmed that it does not work. However, the first method, using an InputStreamReader will work. Unfortunately, this may mean that you have to refactor a bit of code...

Jack Over a year ago

if get a string encoded by GB2312 and then directly System.out.println it, then it should display correctly, right?

Jack Over a year ago

I changed the source code and use InputStreamReader now, but it still is wrong. Could you please have a look at the new source code in my question. I modified and added InputStreamReader

chrisdotcode Over a year ago

Revised the solution with code that worked for me. If the problem persists, could this be a problem with your machine's charset renderings?

|

Philippe · Accepted Answer · 2012-01-05 17:36:47Z

1

Read your inputStream with an InputStreamReader, using the constructor InputStreamReader(InputStream in, Charset cs)

answered Jan 5, 2012 at 17:36

Philippe

6,8483 gold badges32 silver badges51 bronze badges

Collectives™ on Stack Overflow

How to handle non-UTF8 html page in Java?

2 Answers 2

6 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related