How to retrieve HTML page in proper encoding using Java?

Question

How can I read HTTP stream with HTML page in page's encoding?

Here is a code fragment I use to get the HTTP stream. InputStreamReader has the encoding optional argument, but I have no ideas about the way to obtain it.

URLConnection conn = url.openConnection();
InputStream is = conn.getInputStream();
BufferedReader d = new BufferedReader(new InputStreamReader(is));

cletus · Accepted Answer · 2009-08-10 16:01:56Z

4

Retrieving a Webpage is a reasonably complicated process. That's why libraries such as HttpClient exist. My advice is that unless you have a really compelling reason otherwise, use HttpClient.

answered Aug 10, 2009 at 16:01

cletus

627k169 gold badges922 silver badges945 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

informatik01 Over a year ago

Update. HttpClient has been replaced by the Apache HttpComponents project in its HttpClient and HttpCore modules, which offer better performance and more flexibility.

Niger · Accepted Answer · 2009-08-10 16:14:16Z

3

When the connection is establised thru

URLConnection conn = url.openConnection();

you can get the encoding method name thru url.getContentEncoding() so pass this String to InputStreamReader() so the code looks like

BufferedReader d = new BufferedReader(new InputStreamReader(is,url.getContentEncoding()));

answered Aug 10, 2009 at 16:14

Niger

4,0265 gold badges32 silver badges30 bronze badges

4 Comments

pheasant Over a year ago

there is no url.getContentEncoding() method :-(

Yishai Over a year ago

Sure there is. java.sun.com/j2se/1.5.0/docs/api/java/net/…

Niger Over a year ago

which version of java are you using pal?

pheasant Over a year ago

sorry, you are right, i've tried class URL instead of URLConnection

Yishai · Accepted Answer · 2009-08-10 16:23:21Z

1

The short answer is URLConnection.getContentEncoding(). The right answer is what cletus suggests, use an appropriate third party library unless you have a compelling reason not to.

answered Aug 10, 2009 at 16:23

Yishai

92.4k31 gold badges195 silver badges266 bronze badges

1 Comment

Niger Over a year ago

There is no self satisfaction unless the code is written in our won hand, rather seeking for third party.

Sebi · Accepted Answer · 2013-02-12 21:34:03Z

I had a very similar problem to solve recently. Like the other answers, I also started playing around with HttpClient et al. However, those libraries require that you know upfront the encoding of the file you want to download. Otherwise, conversion of the retrieved HTML file will yield in unreadable characters.

This approach won't work, because the encoding of the HTML file is specified only in the HTML file itself. Depending on the HTML version, the encoding is specified in many different ways like XML header, two different head meta tag elements, etc. If you follow this approach, you would need to:

Download file and look at the content to figure out the encoding by parsing the HTML content.
Download file a second time to specify proper encoding.

Especially parsing HTML content for proper encoding strings is error-prone. Instead, I suggest you rely on a library like JSoup, which will do the job for you. So instead of downloading the file via httpclient, use JSoup to retrieve the file for you. In addition, JSoup provides a nice API to access different parts of the HTML page directly (e.g. page title).

Collectives™ on Stack Overflow

How to retrieve HTML page in proper encoding using Java?

4 Answers 4

1 Comment

4 Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

4 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related