0

I was trying this code to read the content from a webpage, i want to read the links, author names below the links and PDF or HTML links on the right side to my database or some doc file using Java.

import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class HTMLParserExample1 {

   public static void main(String[] args) {

      Document doc;
      try {
         // need http protocol
         doc = Jsoup.connect("http://scholar.google.com/scholar?  l=en&q=visualization&btnG=&as_sdt=1%2C4&as_sdtp=").userAgent("Chrome").get();

         Element content = doc.getElementById("content");
         Elements links = content.getElementsByTag("a");
         for (Element link : links) {
            String linkHref = link.attr("href");
            String linkText = link.text();
            System.out.println("\nLinHREF: "+linkHref);
            System.out.println("linktext: "+linkText);
         }


      } catch (IOException e) {
         e.printStackTrace();
      }
   }
}

Above is my code, earlier it was giving me 403 error, but when i put useragent("Mozilla"), then its giving me null pointer exception.

Exception in thread "main" java.lang.NullPointerException
        at HTMLParserExample1.main(HTMLParserExample1.java:20)
Java Result: 1
BUILD SUCCESSFUL (total time: 1 second)

Please help.

2
  • I guess your link is wrong, it doesn't even work on my browser:) Commented Oct 30, 2013 at 8:22
  • scholar.google.com/scholar? l=en&q=visualization&btnG=&as_sdt=1%2C4&as_sdtp= this link is wrong, hence the problem Commented Oct 30, 2013 at 8:38

1 Answer 1

1

Well it works for me if i remove spaces from url http://scholar.google.com/scholar?l=en&q=visualization&btnG=&as_sdt=1%2C4&as_sdtp= is just fine. I strongly suggest to use Google API for web searches insteed of straight google parsing. Here some info about Gdata API.

Sign up to request clarification or add additional context in comments.

1 Comment

Here the link has a 'h instead of the space, still am not able to get, is that the same code you are trying or you have your own.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.