1

This is my HTML source

             <li>
                 <a href="/info/some1>Item 1<br>
                    <span class="deets">111</span>
                 </a>
             </li>

             <li>
                 <a href="/info/some2>Item 2<br>
                    <span class="deets">222</span>
                 </a>
             </li>

             <li>
                 <a href="/info/some3>Item 3<br>
                    <span class="deets">333</span>
                 </a>
             </li>

This is my Java program to get the content & it filters the HTML tags

    try {   
        myurl = new URL("http://www.somewebsite.com");  
        HttpURLConnection con= (HttpURLConnection) myurl.openConnection();

        InputStream result = con.getInputStream();
        BufferedReader reader = new BufferedReader(new InputStreamReader(result));
        StringBuilder sb = new StringBuilder();

        for(String line; (line = reader.readLine()) != null;)
            //append all content & separate using line separator
        sb.append(line).append(System.getProperty("line.separator"));
        String final_result = sb.toString().replaceAll("\\<.*?\\>", "");    

        TextView tv=(TextView) findViewById(R.id.textView1); 
        tv.setText(final_result);


    } 

    catch (Exception e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
        tv.setText("not working");
    }
  1. Is there an easier way using Jsoup to parse the HTML content using Java instead of Regex

  2. Is there a way to get only the required contents. So here I just want the contents "Item 2 - 222"

             <li>
                 <a href="/info/some2>Item 2<br>
                    <span class="deets">222</span>
                 </a>
             </li>
    
1

1 Answer 1

2

Try this for easy parsing using jsoup:

// To parse the html page
Document doc = Jsoup.connect("http://www.website.com").get();
Document doc1 = Jsoup.parse("<html><head><title>First parse</title></head>" + "<body> <p>Parsed HTML into a doc.</p></body></html>");

String content = doc.body().text();

// To get specific elements such as links
Element links = doc.select("a[href]");
for(Element e: links){
    System.out.println("link: " + e.attr("abs:href"));
}

To learn more, visit Jsoup Docs

Sign up to request clarification or add additional context in comments.

8 Comments

Thank you for the help. I am able to implement the same logic in an easier way using Jsoup but I wanted to get content output like this "Item2 222" from the whole HTML source. How do I reference the "Item 2" and "222" in between other values.
Could you explain the desired results in a bit more detail and specifically?
Thank you again. The output i wanted was "Item 2 222" from the HTML source in the question. I got it working anyways. I used Elements href = doc.select("div.featured-resource > div.module > ul > li > a[href=/info/some2]" ); System.out.println(href.text());
you can use doc.getElementById("Item2") and then respective function to get the desired results
Ok so here is my html source again pastebin.com/8bMHbWCh All i want is "BBC Channel 555" as the output for java from the HTML. There is no ID for each element. Can i still extract that info using Jsoup or regex anyhow.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.