1

I'm writing a crawler in java to crawl some websites, which may have some unicode characters such as "£". When I stored the content (source HTML) in a Java String, these kinds of chars get lost and are replaced by the question mark "?". I'd like to know how to keep them intact. The related code is as follows:

protected String readWebPage(String weburl) throws IOException{
        HttpClient httpclient = new DefaultHttpClient();

        HttpGet httpget = new HttpGet(weburl); 
        ResponseHandler<String> responseHandler = new BasicResponseHandler();    
        String responseBody = httpclient.execute(httpget, responseHandler);
        // responseBody now contains the contents of the page
        httpclient.getConnectionManager().shutdown();
        return responseBody;
    }

   // function call
   String res = readWebPage(url);
   PrintWriter out = new PrintWriter(outDir+name+".html");
   out.println(res);
   out.close();

And later when doing character matches, I also want to be able to do something like:

if(text.indexOf("£")>=0)

I don't know if Java will recognize that character and do as what I want it to do.

Any input will be greatly appreciated. Thanks in advance.

4 Answers 4

3

Your non-ASCII characters are either getting lost on input to Java or on output.

Java works with Unicode strings internally so you have to tell it how to decode input and encode output.

Let's assume that HttpClient is correctly interpreting the response from the remote server and is decoding the response correctly.

Next up, you have to ensure that you encode the contents correctly when you write it to disk. Java uses local environment variables to guess what encoding to use, which may not be suitable. To force the encoding, pass the encoding type to PrintWriter:

PrintWriter out = new PrintWriter(outDir+name+".html", "UTF-8");

Then check your output.html with a text editor, such as Notepad++, running in UTF-8 mode to ensure that you can still see non-ASCII chars.

If you can't then you'll need to turn your attention to the input - HttpClient. See this answer: Set response encoding with HttpClient 3.1 for clues if your remote server is lying about the character encoding.

In answer to your sub-question. You can use non-ASCII chars, such as "£", in your source code if you tell Java what character encoding your source code is in. This is a parameter to javac but as you're likely to be using an IDE, you can simply set the character encoding of your file in the properties and the IDE will do the rest. The most portable thing to do is set your character encoding in your IDE to "UTF-8". Eclipse allows you to set the character encoding for the whole project or on individual files.

Sign up to request clarification or add additional context in comments.

Comments

2

Use following code:

FileOutputStream fileStream = new FileOutputStream(outDir+name+".html");
OutputStreamWriter outputStreamWriter = new OutputStreamWriter(fileStream, StandardCharsets.UTF_8)
PrintWriter out = new PrintWriter(outputStreamWriter);

From Charset

A character-encoding scheme is a mapping between one or more coded character sets and a set of octet (eight-bit byte) sequences. UTF-8, UTF-16, ISO 2022, and EUC are examples of character-encoding schemes. Encoding schemes are often associated with a particular coded character set; UTF-8, for example, is used only to encode Unicode. Some schemes, however, are associated with multiple coded character sets; EUC, for example, can be used to encode characters in a variety of Asian coded character sets.

1 Comment

Thank you for you answer. It worked for me. But can you give some explanations? So the unicode chars are actually kept intact in the String res when reading the page from the server, but are lost when writing it to a file in the hard drive? And do you know why the code in the second block works? I'm copying the unicode char into the source code directly to do a regex matching. Just want to know how it works behind the scene. Thanks a lot.
2

There are two steps. First you save the loaded String (in java always Unicode) as UTF-8. But as the browser needs to know the encoding, it has only the HTML meta tags on the file system. So you need to make sure, there is something like

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

2. Write the HTML with UTF-8

PrintWriter out = new PrintWriter(outDir+name+".html", "UTF-8");

1. Patch the HTML charset declaration of the original page into UTF-8 first.

String res2 = res.replaceFirst("charset=([-\\w]+)", "charset=UTF-8")
         .replaceFirst("charset=([\"'])([-\\w]+)\1", "charset=$1UTF-8$1");
if (res2 == res) { // No charset given
      res2 = res.replaceFirst("(?i)</head>",
              "<meta http-equiv='Content-Type' content='text/html; charset=UTF-8' />$0");
}
res = res2;

For HTML meta with either Content-Type or (HTML5) charset.

5 Comments

The 2nd solution seems to be easier... And do you know if I can directly write unicode chars in java source code to do a regex match? Thanks.
I have added an explanation to make the cryptic code's purpose more clear.
@tchrist yes char is UTF-16 /two bytes), and a String consists of chars. A string constant in a .class file is stored as UTF-8. On the introduction of Java it was chosen that String stores full Unicode, I paid tribute to that wise decision.
That isn’t quite right. UTF-16 is a variable-width encoding just like UTF-8 is. It is not “two bites”. All three of UTF-8, UTF-16, and UTF-32 can store “full Unicode”, but too many Java programmers get this wrong because they think a Java char is Unicode, or that it is UTF-16, and these are wrong. They treat UTF-16 as UCS-2 and this causes no end of bugs. So do C# programmers. So do Windows programmers. It’s a mess. Furthermore, a Java string can hold non-Unicode because it has not separated logical characters from physical bytes or chars.
@tchrist "two bytes" referred to char I appreciate the rant; same feelings on Unicode out there. Though String should never contain binary data. There is a conceptual difference between (Unicode) text (String, char, Reader/Writer) amd binary data (byte[], Input- and OutputStream). char` is awkward, java-8 has more code points support (UTF-32) but even so a valid Unicode sign/character might be composed of several code points. But yeah, out there the abuse is tremendous.
0

This took me like a week to solve, I was trying all sorts of things. I am running on Java 1.8 and trying to grab an API Response with unicode characters and only replace the more specific emojis (in the range of \uaaaa - \uffff) that were causing me problems and turning into "?".

mediaResp = mediaResp.replaceAll("\\\\u(?=[a-fA-F][0-9a-fA-F]{3})", "\\\\\\\\u");

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.