I had a very similar problem to solve recently. Like the other answers, I also started playing around with HttpClient et al. However, those libraries require that you know upfront the encoding of the file you want to download. Otherwise, conversion of the retrieved HTML file will yield in unreadable characters.
This approach won't work, because the encoding of the HTML file is specified only in the HTML file itself. Depending on the HTML version, the encoding is specified in many different ways like XML header, two different head meta tag elements, etc. If you follow this approach, you would need to:
- Download file and look at the content to figure out the encoding by parsing the HTML content.
- Download file a second time to specify proper encoding.
Especially parsing HTML content for proper encoding strings is error-prone. Instead, I suggest you rely on a library like JSoup, which will do the job for you. So instead of downloading the file via httpclient, use JSoup to retrieve the file for you. In addition, JSoup provides a nice API to access different parts of the HTML page directly (e.g. page title).