2

I extract some information from the html sourcecode of different pages with jsoup. Most of them are UTF-8 encoded. One of them is encoded with ISO-8859-1, which leads to a strange error (in my optinion).

The page that contains the error is: http://www.gudi.ch/armbanduhr-metall-wasserdicht-1280x960-megapixels-p-560.html

I read the needed String with the following piece of code:

Document doc = Jsoup.connect("http://www.gudi.ch/armbanduhr-metall-wasserdicht-1280x960-megapixels-p-560.html").userAgent("Mozilla").get();
String title = doc.getElementsByClass("products_name").first().text();

The problem is the hyphen in the String "HD Armbanduhr aus Metall 4GB Wasserdicht 1280X960 – 5 Megapixels". Normal umlauts like öäü are read correctly. Only this single character, which is not outputed as "& #45;" makes the problem.

I tried to override the (correctly set) page-encoding with out.outputSettings().charset("ISO-8859-1") but that didn't help either.

Next, i tried do change the encoding of the string with the Charset class from and to utf8 and iso-8859-1 manually. Also no luck.

Has someone a tip on what i can try to get the correct character after parsing the html document with jsoup?

Thanks

2
  • Hmm, it's an — (e2 80 93), which under UTF-8 should be a valid character (I think). Is it possible that once it's read in as 8859-1 it's not possible to convert it back? Can you force-read it in as UTF-8? Commented Oct 10, 2011 at 15:35
  • Yes i can force it with out.outputSettings().charset("UTF-8"), but that doesn't really help. When i want to show the character codes, the result is the charcode 150, which should be valid as seen at this page: web-source.net/symbols.htm. With this, i realized, that the char is not a hyphen or dash, which would be 45. The charcode 150 is within the extended ascii charset. Commented Oct 10, 2011 at 15:55

1 Answer 1

7

This is a mistake of the website itself. It are actually three mistakes:

  1. The page is served without any charset in the HTTP Content-Type response header. There's ISO-8859-1 in the HTML meta tag, but this is ignored when the page is served over HTTP! The average webbrowser will either try smart detection or use platform default encoding to encode the webpage, which is CP1252 on Windows machines.

  2. The <meta> tag pretends that the content is ISO-8859-1 encoded, but the actual character (U+2013 EN DASH) is not covered by that charset at all. It is however covered by the CP1252 charset as 0x0096.

  3. According to the webpage source code, the product name uses the literal character instead of the HTML entity &ndash; as spotted elsewhere on the same webpage.

Jsoup can fix many badly developed webpages transparently, but this one goes really beyond Jsoup. You need to manually read it in and then feed it as CP1252 to Jsoup.

String url = "http://www.gudi.ch/armbanduhr-metall-wasserdicht-1280x960-megapixels-p-560.html";
InputStream input = new URL(url).openStream();
Document doc = Jsoup.parse(input, "CP1252", url);
String title = doc.select(".products_name").first().text();
// ...
Sign up to request clarification or add additional context in comments.

6 Comments

It looks like browsers tend to show 0x96 as en-dash even if ISO-8859-1 is specified in Content-Type header.
@axtavt: there's no charset in the content type header. The platform default charset will be used, which is CP1252 in Windows. See also point 1.
Thanks for the clear explanation about this problem! With the manual encoding (Which i tried the same way yesterday with ISO-8859-1), the content is correctly encoded. I will contact the website operator about this problem, hoping he can correct this problem by setting either the page to utf-8 or setting the Content-Type Header to ISO-8859-1.
Not only that, the offending character must also be fixed. Depending on the source of the problem, it should be fixed by using UTF-8 to store data in DB or to use htmlentities() to redisplay titles in HTML. It's a CP1252 specific character. Alone changing the content type charset to ISO-8859-1 or UTF-8 will fail as this character won't be displayed as such at all then (which is exactly the problem you encountered yourself).
what about the user agent? How can it be set in this case?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.