6

i am looking for a method to extract text from web page (initially html) using jdk or another library . please help

thanks

1
  • 1
    Best Way is using "compile 'org.jsoup:jsoup:1.9.2'" Commented Sep 26, 2016 at 18:58

3 Answers 3

14

Use jsoup. This is currently the most elegant library for screen scraping.

URL url = new URL("http://example.com/");
Document doc = Jsoup.parse(url, 3*1000);
String title = doc.title();

I just love its CSS selector syntax.

Sign up to request clarification or add additional context in comments.

1 Comment

Love jsoup but it doesn't execute associated Javascript. For Javascript rendered pages I use Selenium.
13

Use a HTML parser if at all possible; there are many available for Java.

Or you can use regex like many people do. This is generally not advisable, however, unless you're doing very simplistic processing.

Related questions

Text extraction:

Tag stripping:

Comments

3

Here's a short method that nicely wraps these details (based on java.util.Scanner):

public static String get(String url) throws Exception {
   StringBuilder sb = new StringBuilder();
   for(Scanner sc = new Scanner(new URL(url).openStream()); sc.hasNext(); )
      sb.append(sc.nextLine()).append('\n');
   return sb.toString();
}

And this is how it is used:

public static void main(String[] args) throws Exception {
   System.out.println(get("http://www.yahoo.com"));
}

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.