How to read a text from a web page with Java?

Question

I want to read the text from a web page. I don't want to get the web page's HTML code. I found this code:

    try {
        // Create a URL for the desired page
        URL url = new URL("http://www.uefa.com/uefa/aboutuefa/organisation/congress/news/newsid=1772321.html#uefa+moving+with+tide+history");       

        // Read all the text returned by the server
        BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
        String str;
        while ((str = in.readLine()) != null) {
            str = in.readLine().toString();
            System.out.println(str);
            // str is one line of text; readLine() strips the newline character(s)
        }
        in.close();
    } catch (MalformedURLException e) {
    } catch (IOException e) {
    }

but this code gives me the HTML code of the web page. I want to get the whole text inside this page. How can I do this with Java?

Just parse the text from the HTML tags. From there you can find the info you want and extract it from there. — user1181445
– user1181445, Commented Mar 22, 2012 at 15:49
If you are looking for HTML to DOM stackoverflow.com/questions/457684/… can help you. — Jaydeep Patel
– Jaydeep Patel, Commented Mar 22, 2012 at 16:06
FYI - You are calling in.readLine() twice per iteration, so you actually are skipping every odd line. (Just thought I should point out the bug in this code since it is one of the first results for a google search on reading webpages with Java.) — JPProgrammer
– JPProgrammer, Commented Nov 7, 2013 at 4:54

Fabian Barney · Accepted Answer · 2012-03-22 15:59:55Z

18

You may want to have a look at jsoup for this:

String html = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
Document doc = Jsoup.parse(html); 
String text = doc.body().text(); // "An example link"

This example is an extract from one on their site.

answered Mar 22, 2012 at 15:59

Fabian Barney

14.6k6 gold badges45 silver badges63 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Nitzan Volman · Accepted Answer · 2012-03-22 15:59:22Z

4

Use JSoup.

You will be able to parse the content using css style selectors.

In this example you can try

Document doc = Jsoup.connect("http://www.uefa.com/uefa/aboutuefa/organisation/congress/news/newsid=1772321.html#uefa+moving+with+tide+history").get(); 
String textContents = doc.select(".newsText").first().text();

answered Mar 22, 2012 at 15:59

Nitzan Volman

1,8393 gold badges17 silver badges31 bronze badges

Comments

Paaske · Accepted Answer · 2012-03-22 15:51:55Z

0

You would have to take the content you get with your current code, then parse it and look for the tags that contains the text you want. A sax parser will be well suited for this job.

Or if it is not a particular piece of text you want, simply remove all tags so that you're left with only the text. I guess you could use regexp for that.

answered Mar 22, 2012 at 15:51

Paaske

4,4031 gold badge24 silver badges33 bronze badges

Comments

user2988879 · Accepted Answer · 2015-04-30 13:55:08Z

0

You can also use HtmlCleaner jar. Below is the code.

HtmlCleaner cleaner = new HtmlCleaner();
TagNode node = cleaner.clean( url );

System.out.println( node.getText().toString() );

edited Apr 30, 2015 at 13:55

user2988879

3892 gold badges6 silver badges18 bronze badges

answered May 7, 2013 at 8:59

Prabuddha

1

Comments

Lukasz Ronikier · Accepted Answer · 2022-01-11 07:52:08Z

0

} catch (MalformedURLException e) {
} catch (IOException e) {
}

add at least e.printStackTrace() Will save you many days of your life

answered Jan 11, 2022 at 7:52

Lukasz Ronikier

211 bronze badge

Collectives™ on Stack Overflow

How to read a text from a web page with Java?

5 Answers 5

Comments

Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Comments

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related