8

I request a web page that sends a Content-Encoding: gzip header, but got stuck how to read it..

My code:

    try {
        URLConnection connection = new URL("http://jquery.org").openConnection();                        
        String html = "";
        BufferedReader in = null;
        connection.setReadTimeout(10000);
    in = new BufferedReader(new InputStreamReader(connection.getInputStream()));            
    String inputLine;
    while ((inputLine = in.readLine()) != null){
    html+=inputLine+"\n";
        }
    in.close();
        System.out.println(html);
        System.exit(0);
    } catch (IOException ex) {
        Logger.getLogger(Crawler.class.getName()).log(Level.SEVERE, null, ex);
    }

The output looks very messy.. (I was unable to paste it here, a sort of symbols..)

I believe this is a compressed content, how to parse it?

Note:
If I change jquery.org to jquery.com (which don't send that header, my code works well)

3 Answers 3

17

Actually, this is pb2q's answer, but I post the full code for future readers

try {
    URLConnection connection = new URL("http://jquery.org").openConnection();                        
    String html = "";
    BufferedReader in = null;
    connection.setReadTimeout(10000);
    //The changed part
    if (connection.getHeaderField("Content-Encoding")!=null && connection.getHeaderField("Content-Encoding").equals("gzip")){
        in = new BufferedReader(new InputStreamReader(new GZIPInputStream(connection.getInputStream())));            
    } else {
        in = new BufferedReader(new InputStreamReader(connection.getInputStream()));            
    }     
    //End        
    String inputLine;
    while ((inputLine = in.readLine()) != null){
    html+=inputLine+"\n";
    }
in.close();
    System.out.println(html);
    System.exit(0);
} catch (IOException ex) {
    Logger.getLogger(Crawler.class.getName()).log(Level.SEVERE, null, ex);
}
Sign up to request clarification or add additional context in comments.

1 Comment

Worked for me. Just to add to this, the compressed form can be x-gzip as well. But thanks a lot.
5

There is a class for this: GZIPInputStream. It is an InputStream and so is very transparent to use.

1 Comment

To get it to work in both cases, you need to look at the "Content-Encoding" header that is returned. If its value is "gzip" then you should use the GZipInputStream, otherwise do not.
0

there are two cases with Content-Encoding:gzip header

  1. if data already compressed(by application), Content-Encoding:gizp header will cause data to compressed again.so its double compressed.it's because http compression

  2. if data is not compressed by application, Content-Encoding:gizp will cause data to compress(gzip mostly) and it will automatically uncompressed(un-zip) before it reaches to client. un-zip is default feature available in most of web browsers. browser will do un-zip if it finds Content-Encoding:gizp header in the response.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.