0

I have a HTML page, lets say http://www.crisil.com/Ratings/RatingList/RatingDocs/_G_Telecom_Infra_India_Private_Limited_August_28_2015_RR.html

I want to parse About the Company paragraph and the below table without using any kind of selector or XPath in Java.

I know I can use XPath but I have so many different pages from different domain and XPath might change.

About the Company string will be constant but the position might vary in page to page. Please suggest some solution, I have tried Jsoup, HTMLUnit , DocumentBuilder and some other libraries but looks like most of them rely on tags.

3
  • Why is the requirement not to use XPath? You search for something like <b>About CRISIL LIMITED</b> Commented Nov 10, 2015 at 7:41
  • You could use XPath contains() to select by text, see this (you will still have to use tags in some fashion - that's how HTML is structured - but this approach may help you avoid classes and other things that can change). Commented Nov 10, 2015 at 8:03
  • Because I have n number of different sources, Now I am using a general xpath using java xpathFactory to get the table , but iteration is now a big problem Commented Nov 11, 2015 at 11:55

3 Answers 3

0

you can use beautifulsoup its a python library http://www.crummy.com/software/BeautifulSoup/

However you should have shown us your code trials, so we could possibly help you with your existing code. I could show you some code, its a trivial thing in BeautifulSoup to look for the next Table element after a given part like About the company that you are reading. Write some code in it, and if it doesn't work for you, we'll help.

Sign up to request clarification or add additional context in comments.

1 Comment

solution should be in java
0

XPath does have the ability to select elements by innertext.

Check here: XPath selection by innertext

Comments

0

I would use HtmlUnit and than go for the id="AboutCompanySecDivEdit"

page.getElementById("AboutCompanySecDivEdit");

which will return:

<div style="TEXT-ALIGN: justify; WIDTH: 100%; FONT-FAMILY: verdana, 'ms sans serif', arial; FONT-SIZE: 12px" id="AboutCompanySecDivEdit" jquery171011939482107256965="3">
    <p>
        <span style="FONT-FAMILY: verdana, 'ms sans serif', arial; FONT-SIZE: 12px">Incorporated in 2009, Hyderabad-based 3GTI, is an infrastructure provider of fiber optic in Andhra Pradesh. 3GTI owns a robust fiber network across Andhra Pradesh. 3GT) offers solutions for Enterprise Businesses
            &amp; service Providers. The company is promoted by Mrs.Yarla Geetha, Mrs. M Ratna Kumari &amp; Mrs. Nusrat Moinuddin.</span>
    </p>
</div>

This will only work is all your web sites hve this id set like the one you gave as example.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.