1

Using R and the XML package, I have been trying to extract addresses from html files that have a structure similar to this:

<!DOCTYPE html>
  <body>
    <div class='entry'>
      <span class='name'>Marcus Smith</span>
      <span class='town'>New York</span>
      <span class='phone'>123456789</span>
    </div>
    <div class='entry'>
      <span class='name'>Henry Higgins</span>
      <span class='town'>London</span>
    </div>
    <div class='entry'>
      <span class='name'>Paul Miller</span>
      <span class='town'>Boston</span>
      <span class='phone'>987654321</span>
    </div>
  </body>
</html>

I first do the following

library(XML)
html <- htmlTreeParse("test.html", useInternalNodes = TRUE)
root <- xmlRoot(html)

Now, I can get all the names with this:

xpathSApply(root, "//span[@class='name']", xmlValue)
## [1] "Marcus Smith"  "Henry Higgins" "Paul Miller"

This issue is now that some elements are not present for all the addresses. In the example, this is the phone number:

xpathSApply(root, "//span[@class='phone']", xmlValue)
## [1] "123456789" "987654321"

If I do things like this, there is no way for me to assign the phone numbers to the right person. So, I tried to first extract the entire address book entry as follows:

divs <- getNodeSet(root, "//div[@class='entry']")
divs[[1]]
## <div class="entry">
##   <span class="name">Marcus Smith</span>
##   <span class="town">New York</span>
##   <span class="phone">123456789</span>
## </div> 

From the output I figured that I have reached my goal and that I could get, e.g., the name corresponding to the first entry as follows:

xpathSApply(divs[[1]], "//span[@class='name']", xmlValue)
## [1] "Marcus Smith"  "Henry Higgins" "Paul Miller" 

But even though the output of divs[[1]] showed the data for Marcus Smith only, I get all three names back.

Why is this? And what do I have to do, to extract the address data in such a way, that I know which values for name, town and phone belong together?

4 Answers 4

2

If you have an unknown number of items per entry you can leverage something like dplyr::bind_rows or data.table::rbindlist in combination with rvest as follows:

require(rvest)
require(dplyr)
# Little helper-function to extract all children and set Names
extract_info <- function(node){
  child <- html_children(node)
  as.list(setNames(child %>% html_text(), child %>% html_attr("class")))
}

doc <- read_html(txt)
doc %>% html_nodes(".entry") %>% lapply(extract_info) %>% bind_rows

Gives you:

           name     town     phone
          (chr)    (chr)     (chr)
1  Marcus Smith New York 123456789
2 Henry Higgins   London        NA
3   Paul Miller   Boston 987654321

alternatively use rbindlist(fill=TRUE) instead of bind_rows which leads to a data.table. Or using purrr use map_df(as.list) instead.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for mentioning rvest, which I didn't know. For my current little project, I won't rewrite everything to use rvest, since I managed to reach my goal by only changing minor parts of my code thanks to the answer by Karsten. But I'll give it a try next time.
2

purrr makes rvest much more useful by nesting nodes and hacking the resulting list into a data.frame:

library(rvest)
library(purrr)

html %>% read_html() %>% 
    # select all entry divs
    html_nodes('div.entry') %>% 
    # for each entry div, select all spans, keeping results in a list element
    map(html_nodes, css = 'span') %>% 
    # for each list element, set the name of the text to the class attribute
    map(~setNames(html_text(.x), html_attr(.x, 'class'))) %>% 
    # convert named vectors to list elements; convert list to a data.frame
    map_df(as.list) %>% 
    # convert character vectors to appropriate types
    dmap(type.convert, as.is = TRUE)

## # A tibble: 3 x 3
##            name     town     phone
##           <chr>    <chr>     <int>
## 1  Marcus Smith New York 123456789
## 2 Henry Higgins   London        NA
## 3   Paul Miller   Boston 987654321

You could, of course, replace all the purrr with base, though it will require a few more steps.

2 Comments

possibly use purrr::set_names() vs setNames() since you're already deeply invested in purrr idioms.
I've never found it to be very necessary, as setNames is already easily pipable. Ultimately I guess I stick to setNames out of habit, as I don't always have purrr loaded.
1

Maybe there is something wrong with the xpath expression and "//" always goes to the root element?

This code worked on the test data:

one.entry <- function(x) {
    name <- getNodeSet(x, "span[@class='name']")
    phone <- getNodeSet(x, "span[@class='phone']")
    town <- getNodeSet(x, "span[@class='town']")

    name <- if(length(name)==1) xmlValue(name[[1]]) else NA
    phone <- if(length(phone)==1) xmlValue(phone[[1]]) else NA
    town <- if(length(town)==1) xmlValue(town[[1]]) else NA

    return(data.frame(name=name, phone=phone, town=town, stringsAsFactors=F))
}

do.call(rbind, lapply(divs, one.entry))

1 Comment

Thank you very much. It indeed seems that // goes to the root. This also works: xpathSApply(divs[[1]], "span[@class='name']", xmlValue). I was aware that you can search for nodes with //node and /node, but didn't know that just node also works.
1

Ugly base R+rvest solution (but I cheated and used piping to avoid hellish nested parens or interim assignments) to show how ++gd @alistaire's solution is:

library(rvest)
library(magrittr)

read_html("<!DOCTYPE html>
  <body>
    <div class='entry'>
      <span class='name'>Marcus Smith</span>
      <span class='town'>New York</span>
      <span class='phone'>123456789</span>
    </div>
    <div class='entry'>
      <span class='name'>Henry Higgins</span>
      <span class='town'>London</span>
    </div>
    <div class='entry'>
      <span class='name'>Paul Miller</span>
      <span class='town'>Boston</span>
      <span class='phone'>987654321</span>
    </div>
  </body>
</html>") -> pg

pg %>% 
  html_nodes('div.entry') %>% 
  lapply(html_nodes, css='span') %>% 
  lapply(function(x) { 
    setNames(html_text(x), html_attr(x, 'class')) %>% 
      as.list() %>% 
      as.data.frame(stringsAsFactors=FALSE)
  }) %>% 
  lapply(., unlist) %>% 
  lapply("[", unique(unlist(c(sapply(., names))))) %>% 
  do.call(rbind, .) %>% 
  as.data.frame(stringsAsFactors=FALSE)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.