how to extract web page textual content in java? [closed]

Question

Closed. This question needs to be more focused. It is not currently accepting answers.

Want to improve this question? Guide the asker to update the question so it focuses on a single, specific problem. Narrowing the question will help others answer the question concisely. You may edit the question if you feel you can improve it yourself. If edited, the question will be reviewed and might be reopened.

Closed 10 years ago.

Improve this question

i am looking for a method to extract text from web page (initially html) using jdk or another library . please help

thanks

Best Way is using "compile 'org.jsoup:jsoup:1.9.2'"

VahidHoseini
– VahidHoseini

2016-09-26 18:58:33 +00:00
Commented Sep 26, 2016 at 18:58 — VahidHoseini
– VahidHoseini, Commented Sep 26, 2016 at 18:58

Pascal Thivent · Accepted Answer · 2010-06-14 11:12:07Z

14

Use jsoup. This is currently the most elegant library for screen scraping.

URL url = new URL("http://example.com/");
Document doc = Jsoup.parse(url, 3*1000);
String title = doc.title();

I just love its CSS selector syntax.

answered Jun 14, 2010 at 11:12

Pascal Thivent

572k140 gold badges1.1k silver badges1.1k bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Angsuman Chakraborty Over a year ago

Love jsoup but it doesn't execute associated Javascript. For Javascript rendered pages I use Selenium.

Community · Accepted Answer · 2017-05-23 12:00:14Z

13

Use a HTML parser if at all possible; there are many available for Java.

Or you can use regex like many people do. This is generally not advisable, however, unless you're doing very simplistic processing.

Comments

Itay Maman · Accepted Answer · 2010-06-14 11:13:25Z

3

Here's a short method that nicely wraps these details (based on java.util.Scanner):

public static String get(String url) throws Exception {
   StringBuilder sb = new StringBuilder();
   for(Scanner sc = new Scanner(new URL(url).openStream()); sc.hasNext(); )
      sb.append(sc.nextLine()).append('\n');
   return sb.toString();
}

And this is how it is used:

public static void main(String[] args) throws Exception {
   System.out.println(get("http://www.yahoo.com"));
}

answered Jun 14, 2010 at 11:13

Itay Maman

30.8k12 gold badges93 silver badges124 bronze badges

Collectives™ on Stack Overflow

how to extract web page textual content in java? [closed]

3 Answers 3

1 Comment

Related questions

Comments

Comments

Linked

Hot Network Questions