Using R and the XML package, I have been trying to extract addresses from html files that have a structure similar to this:
<!DOCTYPE html>
<body>
<div class='entry'>
<span class='name'>Marcus Smith</span>
<span class='town'>New York</span>
<span class='phone'>123456789</span>
</div>
<div class='entry'>
<span class='name'>Henry Higgins</span>
<span class='town'>London</span>
</div>
<div class='entry'>
<span class='name'>Paul Miller</span>
<span class='town'>Boston</span>
<span class='phone'>987654321</span>
</div>
</body>
</html>
I first do the following
library(XML)
html <- htmlTreeParse("test.html", useInternalNodes = TRUE)
root <- xmlRoot(html)
Now, I can get all the names with this:
xpathSApply(root, "//span[@class='name']", xmlValue)
## [1] "Marcus Smith" "Henry Higgins" "Paul Miller"
This issue is now that some elements are not present for all the addresses. In the example, this is the phone number:
xpathSApply(root, "//span[@class='phone']", xmlValue)
## [1] "123456789" "987654321"
If I do things like this, there is no way for me to assign the phone numbers to the right person. So, I tried to first extract the entire address book entry as follows:
divs <- getNodeSet(root, "//div[@class='entry']")
divs[[1]]
## <div class="entry">
## <span class="name">Marcus Smith</span>
## <span class="town">New York</span>
## <span class="phone">123456789</span>
## </div>
From the output I figured that I have reached my goal and that I could get, e.g., the name corresponding to the first entry as follows:
xpathSApply(divs[[1]], "//span[@class='name']", xmlValue)
## [1] "Marcus Smith" "Henry Higgins" "Paul Miller"
But even though the output of divs[[1]] showed the data for Marcus Smith only, I get all three names back.
Why is this? And what do I have to do, to extract the address data in such a way, that I know which values for name, town and phone belong together?