Hello I'm quite new to R and I'm trying to scrape a web site for some data. The problem is that the data is stored inconsistently.
Sometimes I see:
<div class = "text"> The text I want </div>
And other times I see:
<div class = "text"><div class = "text"> The text I want </div></div>
So far I'm using the XML package and the following R code:
doc = htmlTreeParse(url, useInternalNodes = T)
text = xpathSApply(doc, "//*/div[@class='text']", xmlValue)
The problem is that this code will count "The text I want" twice when it comes across the second example, because it finds the <div class> attribute twice. I only want to count it once because it only appears once.
Any pointers are greatly appreciated!