Parsing HTML doc without using tag or any other selector in Java

Question

I have a HTML page, lets say http://www.crisil.com/Ratings/RatingList/RatingDocs/_G_Telecom_Infra_India_Private_Limited_August_28_2015_RR.html

I want to parse About the Company paragraph and the below table without using any kind of selector or XPath in Java.

I know I can use XPath but I have so many different pages from different domain and XPath might change.

About the Company string will be constant but the position might vary in page to page. Please suggest some solution, I have tried Jsoup, HTMLUnit , DocumentBuilder and some other libraries but looks like most of them rely on tags.

Why is the requirement not to use XPath? You search for something like <b>About CRISIL LIMITED</b> — Ahmed Ashour
– Ahmed Ashour, Commented Nov 10, 2015 at 7:41
You could use XPath contains() to select by text, see this (you will still have to use tags in some fashion - that's how HTML is structured - but this approach may help you avoid classes and other things that can change). — halfer
– halfer, Commented Nov 10, 2015 at 8:03
Because I have n number of different sources, Now I am using a general xpath using java xpathFactory to get the table , but iteration is now a big problem — spondon majumdar
– spondon majumdar, Commented Nov 11, 2015 at 11:55

Brij Raj Singh - MSFT · Accepted Answer · 2015-11-10 06:01:56Z

0

you can use beautifulsoup its a python library http://www.crummy.com/software/BeautifulSoup/

However you should have shown us your code trials, so we could possibly help you with your existing code. I could show you some code, its a trivial thing in BeautifulSoup to look for the next Table element after a given part like About the company that you are reading. Write some code in it, and if it doesn't work for you, we'll help.

answered Nov 10, 2015 at 6:01

Brij Raj Singh - MSFT

5,1137 gold badges39 silver badges57 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

spondon majumdar Over a year ago

solution should be in java

Community · Accepted Answer · 2017-05-23 11:59:20Z

0

XPath does have the ability to select elements by innertext.

Check here: XPath selection by innertext

edited May 23, 2017 at 11:59

CommunityBot

11 silver badge

answered Nov 10, 2015 at 22:22

N K

4013 silver badges14 bronze badges

Comments

MrSmith42 · Accepted Answer · 2015-11-11 14:55:01Z

I would use HtmlUnit and than go for the id="AboutCompanySecDivEdit"

page.getElementById("AboutCompanySecDivEdit");

which will return:

<div style="TEXT-ALIGN: justify; WIDTH: 100%; FONT-FAMILY: verdana, 'ms sans serif', arial; FONT-SIZE: 12px" id="AboutCompanySecDivEdit" jquery171011939482107256965="3">
    <p>
        <span style="FONT-FAMILY: verdana, 'ms sans serif', arial; FONT-SIZE: 12px">Incorporated in 2009, Hyderabad-based 3GTI, is an infrastructure provider of fiber optic in Andhra Pradesh. 3GTI owns a robust fiber network across Andhra Pradesh. 3GT) offers solutions for Enterprise Businesses
            &amp; service Providers. The company is promoted by Mrs.Yarla Geetha, Mrs. M Ratna Kumari &amp; Mrs. Nusrat Moinuddin.</span>
    </p>
</div>

This will only work is all your web sites hve this id set like the one you gave as example.

Collectives™ on Stack Overflow

Parsing HTML doc without using tag or any other selector in Java

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related