1

I am trying to download the web content from Google https as in the link below.

link to download

With the code below, I first disable the validation of certificates for testing purposes and trust all certificates, and then download the web as regular http, but for some reason, it is not successful:

public static void downloadWeb() {
        // Create a new trust manager that trust all certificates
        TrustManager[] trustAllCerts = new TrustManager[] { new X509TrustManager() {
            public java.security.cert.X509Certificate[] getAcceptedIssuers() {
                return null;
            }

            public void checkClientTrusted(
                    java.security.cert.X509Certificate[] certs, String authType) {
            }

            public void checkServerTrusted(
                    java.security.cert.X509Certificate[] certs, String authType) {
            }
        } };

    // Activate the new trust manager
        try {
            SSLContext sc = SSLContext.getInstance("SSL");
            sc.init(null, trustAllCerts, new java.security.SecureRandom());
            HttpsURLConnection
                    .setDefaultSSLSocketFactory(sc.getSocketFactory());
        } catch (Exception e) {}

            //begin download as regular http
        try {
            String wordAddress = "https://www.google.com/webhp?hl=en&tab=ww#hl=en&tbs=dfn:1&sa=X&ei=obxCUKm7Ic3GqAGvoYGIBQ&ved=0CDAQBSgA&q=pronunciation&spell=1&bav=on.2,or.r_gc.r_pw.r_cp.r_qf.&fp=c5bfe0fbd78a3271&biw=1024&bih=759";
            URLConnection yc = new URL(wordAddress).openConnection();
            BufferedReader in = new BufferedReader(new InputStreamReader(
                    yc.getInputStream()));
            String inputLine = "";
            while ((inputLine = in.readLine()) != null) {
                System.out.println(wordAddress);
            }

        } catch (IOException e) {}

    }
5
  • Do you have to use Java? Commented Sep 2, 2012 at 0:55
  • yes, but do you have any recommendation on other language? Commented Sep 2, 2012 at 0:57
  • If I didn't have to use Java, I'd use wget or cURL and make a shell script (or batch file). Commented Sep 2, 2012 at 0:58
  • that is completely new to me. I will think about that, thanks a lot Commented Sep 2, 2012 at 1:00
  • On another note, Google's Terms of Service forbids automated queries. Commented Sep 2, 2012 at 1:05

1 Answer 1

1

You need to fake HTTP headers so that google think that you are downloading it from a web browser. Here is a sample code using HttpClient:

import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import org.apache.http.HttpResponse;
import org.apache.http.client.HttpClient;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.DefaultHttpClient;

public class App1 {

    public static void main(String[] args) throws IOException {
        HttpClient httpclient = new DefaultHttpClient();
        HttpGet httpget = new HttpGet("http://_google_url_");
        httpget.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 5.1; rv:8.0) Gecko/20100101 Firefox/8.0");
        HttpResponse execute = httpclient.execute(httpget);
        File file = new File("google.html");
        FileOutputStream fout = null;
        try {
            fout = new FileOutputStream(file);
            execute.getEntity().writeTo(fout);
        } finally {
            if (fout != null) {
                fout.close();
            }
        }
    }
}

Warning, I am not responsible if you use this code and violate Google's term of service agreement.

Sign up to request clarification or add additional context in comments.

8 Comments

thanks, I just make some few queries. Do I need to install something from Apache?
HttpClient and HttpCore from the download link in the page link above.
I am able to get content from Google now, but what I want is the Dictionary page. For example, the dictionary page of word "pronuncation" google.com/…
Google prevent you from doing that. The dictionary content is probably an ajax request which run dynamically. You might need to read the content you get from the downloaded page and the find out how google make ajax request and use that url to download your dictionary. Basically, it is fairly hard. what you want is probably google dictionary api googlesystem.blogspot.co.uk/2009/12/…
this should be easier than downloading the google webpage yourself and it doesn't violate term and condition
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.