Using R to scrape nested XML data

Question

Hello I'm quite new to R and I'm trying to scrape a web site for some data. The problem is that the data is stored inconsistently.

Sometimes I see:

<div class = "text">   The text I want   </div>

And other times I see:

<div class = "text"><div class = "text">   The text I want   </div></div>

So far I'm using the XML package and the following R code:

doc = htmlTreeParse(url, useInternalNodes = T)
text = xpathSApply(doc, "//*/div[@class='text']", xmlValue)

The problem is that this code will count "The text I want" twice when it comes across the second example, because it finds the <div class> attribute twice. I only want to count it once because it only appears once.

Any pointers are greatly appreciated!

user1609452 · Accepted Answer · 2013-01-08 15:32:17Z

2

xtext <- "<div class = \"text\">   The text I want   </div>
</div><div class = \"text\"><div class = \"text\">   The text I want   </div></div>"
doc <- htmlParse(xtext)
xpathSApply(doc,"//*/div[@class='text']/text()")

#[[1]]
#   The text I want    

#[[2]]
#   The text I want

answered Jan 8, 2013 at 15:32

user1609452

4,4641 gold badge17 silver badges20 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Richie Cotton · Accepted Answer · 2013-01-08 13:58:12Z

2

If you just want to count occurrences, then you should be able to find all nodes

all_text <- xpathSApply(doc, "//*/div[@class='text']", xmlValue)

and doubled nodes

doubled_text <- xpathSApply(doc, "//*/div[@class='text']/div[@class='text']", xmlValue)

then subtract the length of one from the other to get a true reflection.

answered Jan 8, 2013 at 13:58

Richie Cotton

122k47 gold badges254 silver badges371 bronze badges

1 Comment

Rez99 Over a year ago

Hello, I don't actually want to count occurrences (apologies for the confusion) - I want to return "The text I want" everywhere it is found on the web page. But I don't want to duplicate occurrences when it is found within nested <div class> statements

Collectives™ on Stack Overflow

Using R to scrape nested XML data

2 Answers 2

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related