How to parse an html file with a nested structure?

Question

Using R and the XML package, I have been trying to extract addresses from html files that have a structure similar to this:

<!DOCTYPE html>
  <body>
    <div class='entry'>
      <span class='name'>Marcus Smith</span>
      <span class='town'>New York</span>
      <span class='phone'>123456789</span>
    </div>
    <div class='entry'>
      <span class='name'>Henry Higgins</span>
      <span class='town'>London</span>
    </div>
    <div class='entry'>
      <span class='name'>Paul Miller</span>
      <span class='town'>Boston</span>
      <span class='phone'>987654321</span>
    </div>
  </body>
</html>

I first do the following

library(XML)
html <- htmlTreeParse("test.html", useInternalNodes = TRUE)
root <- xmlRoot(html)

Now, I can get all the names with this:

xpathSApply(root, "//span[@class='name']", xmlValue)
## [1] "Marcus Smith"  "Henry Higgins" "Paul Miller"

This issue is now that some elements are not present for all the addresses. In the example, this is the phone number:

xpathSApply(root, "//span[@class='phone']", xmlValue)
## [1] "123456789" "987654321"

If I do things like this, there is no way for me to assign the phone numbers to the right person. So, I tried to first extract the entire address book entry as follows:

divs <- getNodeSet(root, "//div[@class='entry']")
divs[[1]]
## <div class="entry">
##   <span class="name">Marcus Smith</span>
##   <span class="town">New York</span>
##   <span class="phone">123456789</span>
## </div>

From the output I figured that I have reached my goal and that I could get, e.g., the name corresponding to the first entry as follows:

xpathSApply(divs[[1]], "//span[@class='name']", xmlValue)
## [1] "Marcus Smith"  "Henry Higgins" "Paul Miller"

But even though the output of divs[[1]] showed the data for Marcus Smith only, I get all three names back.

Why is this? And what do I have to do, to extract the address data in such a way, that I know which values for name, town and phone belong together?

Rentrop · Accepted Answer · 2016-08-19 22:05:53Z

2

If you have an unknown number of items per entry you can leverage something like dplyr::bind_rows or data.table::rbindlist in combination with rvest as follows:

require(rvest)
require(dplyr)
# Little helper-function to extract all children and set Names
extract_info <- function(node){
  child <- html_children(node)
  as.list(setNames(child %>% html_text(), child %>% html_attr("class")))
}

doc <- read_html(txt)
doc %>% html_nodes(".entry") %>% lapply(extract_info) %>% bind_rows

Gives you:

           name     town     phone
          (chr)    (chr)     (chr)
1  Marcus Smith New York 123456789
2 Henry Higgins   London        NA
3   Paul Miller   Boston 987654321

alternatively use rbindlist(fill=TRUE) instead of bind_rows which leads to a data.table. Or using purrr use map_df(as.list) instead.

edited Aug 19, 2016 at 22:05

answered Aug 19, 2016 at 21:58

Rentrop

21.6k12 gold badges75 silver badges104 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Stibu Over a year ago

Thanks for mentioning rvest, which I didn't know. For my current little project, I won't rewrite everything to use rvest, since I managed to reach my goal by only changing minor parts of my code thanks to the answer by Karsten. But I'll give it a try next time.

alistaire · Accepted Answer · 2016-08-19 22:07:18Z

2

purrr makes rvest much more useful by nesting nodes and hacking the resulting list into a data.frame:

library(rvest)
library(purrr)

html %>% read_html() %>% 
    # select all entry divs
    html_nodes('div.entry') %>% 
    # for each entry div, select all spans, keeping results in a list element
    map(html_nodes, css = 'span') %>% 
    # for each list element, set the name of the text to the class attribute
    map(~setNames(html_text(.x), html_attr(.x, 'class'))) %>% 
    # convert named vectors to list elements; convert list to a data.frame
    map_df(as.list) %>% 
    # convert character vectors to appropriate types
    dmap(type.convert, as.is = TRUE)

## # A tibble: 3 x 3
##            name     town     phone
##           <chr>    <chr>     <int>
## 1  Marcus Smith New York 123456789
## 2 Henry Higgins   London        NA
## 3   Paul Miller   Boston 987654321

You could, of course, replace all the purrr with base, though it will require a few more steps.

edited Aug 19, 2016 at 22:07

answered Aug 19, 2016 at 22:02

alistaire

43.5k4 gold badges80 silver badges119 bronze badges

2 Comments

hrbrmstr Over a year ago

possibly use purrr::set_names() vs setNames() since you're already deeply invested in purrr idioms.

alistaire Over a year ago

I've never found it to be very necessary, as setNames is already easily pipable. Ultimately I guess I stick to setNames out of habit, as I don't always have purrr loaded.

Karsten W. · Accepted Answer · 2016-08-19 21:45:52Z

1

Maybe there is something wrong with the xpath expression and "//" always goes to the root element?

This code worked on the test data:

one.entry <- function(x) {
    name <- getNodeSet(x, "span[@class='name']")
    phone <- getNodeSet(x, "span[@class='phone']")
    town <- getNodeSet(x, "span[@class='town']")

    name <- if(length(name)==1) xmlValue(name[[1]]) else NA
    phone <- if(length(phone)==1) xmlValue(phone[[1]]) else NA
    town <- if(length(town)==1) xmlValue(town[[1]]) else NA

    return(data.frame(name=name, phone=phone, town=town, stringsAsFactors=F))
}

do.call(rbind, lapply(divs, one.entry))

answered Aug 19, 2016 at 21:45

Karsten W.

18.6k12 gold badges74 silver badges114 bronze badges

1 Comment

Stibu Over a year ago

Thank you very much. It indeed seems that // goes to the root. This also works: xpathSApply(divs[[1]], "span[@class='name']", xmlValue). I was aware that you can search for nodes with //node and /node, but didn't know that just node also works.

hrbrmstr · Accepted Answer · 2016-08-20 02:31:47Z

Ugly base R+rvest solution (but I cheated and used piping to avoid hellish nested parens or interim assignments) to show how ++gd @alistaire's solution is:

library(rvest)
library(magrittr)

read_html("<!DOCTYPE html>
  <body>
    <div class='entry'>
      <span class='name'>Marcus Smith</span>
      <span class='town'>New York</span>
      <span class='phone'>123456789</span>
    </div>
    <div class='entry'>
      <span class='name'>Henry Higgins</span>
      <span class='town'>London</span>
    </div>
    <div class='entry'>
      <span class='name'>Paul Miller</span>
      <span class='town'>Boston</span>
      <span class='phone'>987654321</span>
    </div>
  </body>
</html>") -> pg

pg %>% 
  html_nodes('div.entry') %>% 
  lapply(html_nodes, css='span') %>% 
  lapply(function(x) { 
    setNames(html_text(x), html_attr(x, 'class')) %>% 
      as.list() %>% 
      as.data.frame(stringsAsFactors=FALSE)
  }) %>% 
  lapply(., unlist) %>% 
  lapply("[", unique(unlist(c(sapply(., names))))) %>% 
  do.call(rbind, .) %>% 
  as.data.frame(stringsAsFactors=FALSE)

Collectives™ on Stack Overflow

How to parse an html file with a nested structure?

4 Answers 4

1 Comment

2 Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

2 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related