Parsing Html content using Jsoup

Question

This is my HTML source

             <li>
                 <a href="/info/some1>Item 1<br>
                    <span class="deets">111</span>
                 </a>
             </li>

             <li>
                 <a href="/info/some2>Item 2<br>
                    <span class="deets">222</span>
                 </a>
             </li>

             <li>
                 <a href="/info/some3>Item 3<br>
                    <span class="deets">333</span>
                 </a>
             </li>

This is my Java program to get the content & it filters the HTML tags

    try {   
        myurl = new URL("http://www.somewebsite.com");  
        HttpURLConnection con= (HttpURLConnection) myurl.openConnection();

        InputStream result = con.getInputStream();
        BufferedReader reader = new BufferedReader(new InputStreamReader(result));
        StringBuilder sb = new StringBuilder();

        for(String line; (line = reader.readLine()) != null;)
            //append all content & separate using line separator
        sb.append(line).append(System.getProperty("line.separator"));
        String final_result = sb.toString().replaceAll("\\<.*?\\>", "");    

        TextView tv=(TextView) findViewById(R.id.textView1); 
        tv.setText(final_result);


    } 

    catch (Exception e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
        tv.setText("not working");
    }

Is there an easier way using Jsoup to parse the HTML content using Java instead of Regex

Is there a way to get only the required contents. So here I just want the contents "Item 2 - 222"

         <li>
             <a href="/info/some2>Item 2<br>
                <span class="deets">222</span>
             </a>
         </li>

stackoverflow.com/questions/21336845/…

ZaoTaoBao
– ZaoTaoBao

2014-06-14 22:32:44 +00:00
Commented Jun 14, 2014 at 22:32 — ZaoTaoBao
– ZaoTaoBao, Commented Jun 14, 2014 at 22:32

Jeeshu Mittal · Accepted Answer · 2014-06-14 23:17:03Z

2

Try this for easy parsing using jsoup:

// To parse the html page
Document doc = Jsoup.connect("http://www.website.com").get();
Document doc1 = Jsoup.parse("<html><head><title>First parse</title></head>" + "<body> <p>Parsed HTML into a doc.</p></body></html>");

String content = doc.body().text();

// To get specific elements such as links
Element links = doc.select("a[href]");
for(Element e: links){
    System.out.println("link: " + e.attr("abs:href"));
}

To learn more, visit Jsoup Docs

answered Jun 14, 2014 at 23:17

Jeeshu Mittal

4453 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

Vinay Potluri Over a year ago

Thank you for the help. I am able to implement the same logic in an easier way using Jsoup but I wanted to get content output like this "Item2 222" from the whole HTML source. How do I reference the "Item 2" and "222" in between other values.

Jeeshu Mittal Over a year ago

Could you explain the desired results in a bit more detail and specifically?

Vinay Potluri Over a year ago

Thank you again. The output i wanted was "Item 2 222" from the HTML source in the question. I got it working anyways. I used Elements href = doc.select("div.featured-resource > div.module > ul > li > a[href=/info/some2]" ); System.out.println(href.text());

Jeeshu Mittal Over a year ago

you can use doc.getElementById("Item2") and then respective function to get the desired results

Vinay Potluri Over a year ago

Ok so here is my html source again pastebin.com/8bMHbWCh All i want is "BBC Channel 555" as the output for java from the HTML. There is no ID for each element. Can i still extract that info using Jsoup or regex anyhow.

|

Collectives™ on Stack Overflow

Parsing Html content using Jsoup

1 Answer 1

8 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

8 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related