Scrape nested html structure

Question

I would like to scrape the data from this site, without losing the information from the nested structure. Consider the name benodanil, which not only belongs to benzanilide fungicides, but also to anilide fungicides and amide fungicides. It's not necessarily always 3 classes, but at least one and up to many. So, ideally, I'd want a data.frame that looks as such:

name	class1	class2	class3	...
benodanil	benzanilide fungicides	anilide fungicides	amide fungicides	NA
aureofungin	antibiotic fungicides	NA	NA	NA
...	...	...	...

I can scrape the data, but can't wrap my head around how to handle the information in the nested structure. What I tried so far:

require(rvest)

url = 'http://www.alanwood.net/pesticides/class_fungicides.html'

site = read_html(url)
# extract lists
li = html_nodes(site, 'li')
# extract unorder lists
ul = html_nodes(site, 'ul')

# loop idea
l = list()
for (i in seq_along(li)) {
  li1 = html_nodes(li[i], 'a')
  name = na.omit(unique(html_attr(li1, 'href')))
  clas = na.omit(unique(html_attr(li1, 'name')))
  
  l[[i]] = list(name = name,
                clas = clas)
}

An additional problem is, that some names occur more than one time, such as bixafen. Hence, I guess the job has to be done iteratively.

Ronak Shah · Accepted Answer · 2021-06-05 12:52:51Z

library(dplyr)
library(tidyr)
library(rvest)

url = 'http://www.alanwood.net/pesticides/class_fungicides.html'

site = read_html(url)
a <- site %>% html_nodes('li ul a')

tibble(name = a %>% html_attr('href'), 
       class = a %>% html_attr('name')) %>%
  fill(class) %>%
  filter(!is.na(name)) %>%
  mutate(name = sub('\\.html', '', name)) %>%
  group_by(name) %>%
  mutate(col = paste0('class', row_number())) %>%
  pivot_wider(names_from = col, values_from = class) %>%
  ungroup()

# A tibble: 189 x 4
#   name         class1                  class2                class3                     
#   <chr>        <chr>                   <chr>                 <chr>                      
# 1 benalaxyl    acylamino_acid_fungici… anilide_fungicides    NA                         
# 2 benalaxyl-m  acylamino_acid_fungici… anilide_fungicides    NA                         
# 3 furalaxyl    acylamino_acid_fungici… furanilide_fungicides NA                         
# 4 metalaxyl    acylamino_acid_fungici… anilide_fungicides    NA                         
# 5 metalaxyl-m  acylamino_acid_fungici… anilide_fungicides    NA                         
# 6 pefurazoate  acylamino_acid_fungici… NA                    NA                         
# 7 valifenalate acylamino_acid_fungici… NA                    NA                         
# 8 bixafen      anilide_fungicides      picolinamide_fungici… pyrazolecarboxamide_fungic…
# 9 boscalid     anilide_fungicides      NA                    NA                         
#10 carboxin     anilide_fungicides      NA                    NA                         
# … with 179 more rows

Extract name and class from the webpage, fill the NA values with the previous non-NA, drop rows with NA values and get the data in wide format.

Collectives™ on Stack Overflow

Scrape nested html structure

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related