1

Hello I'm quite new to R and I'm trying to scrape a web site for some data. The problem is that the data is stored inconsistently.

Sometimes I see:

<div class = "text">   The text I want   </div>

And other times I see:

<div class = "text"><div class = "text">   The text I want   </div></div>

So far I'm using the XML package and the following R code:

doc = htmlTreeParse(url, useInternalNodes = T)
text = xpathSApply(doc, "//*/div[@class='text']", xmlValue) 

The problem is that this code will count "The text I want" twice when it comes across the second example, because it finds the <div class> attribute twice. I only want to count it once because it only appears once.

Any pointers are greatly appreciated!

2 Answers 2

2
xtext <- "<div class = \"text\">   The text I want   </div>
</div><div class = \"text\"><div class = \"text\">   The text I want   </div></div>"
doc <- htmlParse(xtext)
xpathSApply(doc,"//*/div[@class='text']/text()")

#[[1]]
#   The text I want    

#[[2]]
#   The text I want    
Sign up to request clarification or add additional context in comments.

Comments

2

If you just want to count occurrences, then you should be able to find all nodes

all_text <- xpathSApply(doc, "//*/div[@class='text']", xmlValue)

and doubled nodes

doubled_text <- xpathSApply(doc, "//*/div[@class='text']/div[@class='text']", xmlValue)

then subtract the length of one from the other to get a true reflection.

1 Comment

Hello, I don't actually want to count occurrences (apologies for the confusion) - I want to return "The text I want" everywhere it is found on the web page. But I don't want to duplicate occurrences when it is found within nested <div class> statements

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.