how to use Xpath with LibXml 2

Question

in this address i am trying to scrape a tage (that is Larg price which is bold red one)

i use LIBXML 2.2

when i try to extract the tag through this XPATH

//*[@class='priceLarge']

it works!

but to make queries easier i would like to use FireBug on Firefox.

Using FireBug it gives me this XPath

/html/body/div[2]/form/table[3]/tbody/tr/td/div/table/tbody/tr[2]/td[2]/span/b

using this Xpath it does not work, seems this one does not give a complete query. how can i modify this XPath to scrape the item ?

Cameron S · Accepted Answer · 2012-01-03 06:41:25Z

2

Firefox and other browsers generate tbody tags in HTML.

In fact, the tbody is probably not there, so you can remove it in your XPath. (/html/body/div[2]/form/table[3]/tr/td/div/table/tr[2]/td[2]/span/b) You can test this by just saving the HTML from your application and viewing it in a text editor.

Since it seems the intent is to pull information from a web page however, your application will probably be more resistant to changes in the web page if you use XPath less dependent on the tree structure (i.e. //b[@class='priceLarge']).

EDIT: It seems that in addition to the tbody problem, Firefox is rendering the div (ID: divsinglecolumnminwidth) element as containing the form element (ID: handleBuy).

Looking at the html with an XML editor shows that the form element is a sibling of that div element, so the expression should start with /html/body/form/table[3].

One tool, among many others, to test your XPath expressions is HAP Testbed.

edited Jan 3, 2012 at 6:41

answered Jan 3, 2012 at 4:26

Cameron S

2,3111 gold badge16 silver badges18 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

user1095058 Over a year ago

i tried it without tbody, but still it does not work ! any idea?

user1095058 Over a year ago

----/html/body/div[2]/form/table[3]/tbody/tr/td/div/table/tbody/tr[2]/td[2]/span[1]/b --- i tried this in an evaluator and in there it gave me the right value. i just changed the last spin to spin[1]. but still when i use it with LibXml 2 it does not work! i dont know why !

Cameron S Over a year ago

Don't copy the HTML into the evaluator from Firefox view source if you are not copying it into your program. If you are downloading it in your program, input the URL into the evaluator (if possible) or save the HTML downloaded by your program and copy that. We need the HTML downloaded by the application, not processed by a browser. Aside from that, slowly build the XPath in LibXml2 until you do not hit a value and that will narrow down the differences.

Collectives™ on Stack Overflow

how to use Xpath with LibXml 2

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related