Scraping HTML Table with XML in R

Question

I am trying to scrape text values from a website. I have been able to parse the url. I am new to XPath in R. So I am not sure how to pull out all the text values that has tag as

'<p class="MsoNormal" align="justify"> text </p>.'

How do I specify the path to the the specific tag and get the text value. This is what I am trying right now.

pizzaraw<-xpathSApply(pizzadoc, "//p[@class='MsoNormal']", xmlValue)

Is this the right approach. R seems not responding to the code.

Quick summary of XPath: //p will give you all p elements (ignoring nesting). //p[1] will return the first p. //p[1]/text() will return the text contents. //p[1]/@class will return the contents of the class attribute, and so on. — helderdarocha
– helderdarocha, Commented Apr 17, 2014 at 20:43
It might be helpful to look at the selectr package also. This allows you to use css selectors rather then xpaths in tandem with the XML package. It also allows you to easily handle namespaces which maybe the problem you are having here. — jdharrison
– jdharrison, Commented Apr 17, 2014 at 20:59

G. Grothendieck · Accepted Answer · 2014-04-17 20:58:00Z

1

Its difficult to know what is wrong given that your example is not self-contained but here is a self-contained one that works:

Lines <- '<html>
<p class="MsoNormal" align="justify"> text </p>
</html>
'

library(XML)
root <- htmlTreeParse(Lines, asText = TRUE, useInternalNodes = TRUE)
doc <- xmlRoot(root)
xpathSApply(doc, '//p[@class="MsoNormal"]', xmlValue, trim = TRUE)
## [1] "text"

answered Apr 17, 2014 at 20:58

G. Grothendieck

273k18 gold badges221 silver badges365 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Scraping HTML Table with XML in R

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related