I would like to scrape the data from this site, without losing the information from the nested structure. Consider the name benodanil, which not only belongs to benzanilide fungicides, but also to anilide fungicides and amide fungicides. It's not necessarily always 3 classes, but at least one and up to many. So, ideally, I'd want a data.frame that looks as such:
| name | class1 | class2 | class3 | ... |
|---|---|---|---|---|
| benodanil | benzanilide fungicides | anilide fungicides | amide fungicides | NA |
| aureofungin | antibiotic fungicides | NA | NA | NA |
| ... | ... | ... | ... |
I can scrape the data, but can't wrap my head around how to handle the information in the nested structure. What I tried so far:
require(rvest)
url = 'http://www.alanwood.net/pesticides/class_fungicides.html'
site = read_html(url)
# extract lists
li = html_nodes(site, 'li')
# extract unorder lists
ul = html_nodes(site, 'ul')
# loop idea
l = list()
for (i in seq_along(li)) {
li1 = html_nodes(li[i], 'a')
name = na.omit(unique(html_attr(li1, 'href')))
clas = na.omit(unique(html_attr(li1, 'name')))
l[[i]] = list(name = name,
clas = clas)
}
An additional problem is, that some names occur more than one time, such as bixafen. Hence, I guess the job has to be done iteratively.