R Read & Parse HTML to List

Question

I have been trying to read & parse a bit of HTML to obtain a list of conditions for animals at an animal shelter. I'm sure my inexperience with HTML parsing isn't helping, but I seem to be getting no where fast.

Here's a snippet of the HTML:

<select multiple="true" name="asilomarCondition" id="asilomarCondition">

    <option value="101">
        Behavior- Aggression, Confrontational-Toward People (mild)
        -
        TM</option>
....
</select>

There's only one tag with <select...> and the rest are all <option value=x>.

I've been using the XML library. I can remove the newlines and tabs, but haven't had any success removing the tags:

conditions.html <- paste(readLines("Data/evalconditions.txt"), collapse="\n")
conditions.text <- gsub('[\t\n]',"",conditions.html)

As a final result, I'd like a list of all of the conditions that I can process further for later use as factor names:

Behavior- Aggression, Confrontational-Toward People (mild)-TM
Behavior- Aggression, Confrontational-Toward People (moderate/severe)-UU
...

I'm not sure if I need to use the XML library (or another library) or if gsub patterns would be sufficient (either way, I need to work out how to use it).

Can you point to the full URL with that select box or expand the snippet a bit? — hrbrmstr
– hrbrmstr, Commented Aug 11, 2016 at 23:20
I find the rvest package easier to use. If you can provide a link to the website, someone could code up a solution of you. — Dave2e
– Dave2e, Commented Aug 11, 2016 at 23:43
Oops, true. library(rvest) ; html %>% read_html() %>% html_nodes('option') %>% html_text(trim = TRUE) — alistaire
– alistaire, Commented Aug 12, 2016 at 1:29
Unfortunately, I can't provide a URL. It is an online DBMS with user access only. I'm a volunteer with the shelter trying to help with some data analysis. I could pull the whole page, but there's likely sensitive data in there. I just took one animal instance to get at the part I need. I can post the entire snippet I pulled if that would be useful. I'll look into the rvest library, though! — kimbekaw
– kimbekaw, Commented Aug 12, 2016 at 1:35

Dave2e · Accepted Answer · 2016-08-12 23:02:04Z

2

Here is a start using the rvest package:

library(rvest)
#read the html page
page<-read_html("test.html")
#get the text from the "option" nodes and then trim the whitespace
nodes<-trimws(html_text(html_nodes(page, "option")))

#nodes will need additional clean up to remove the excessive spaces 
#and newline characters
nodes<-gsub("\n", "", nodes)
nodes<-gsub("  ", "", nodes)

The vector nodes should be the result which you requested. This example is based on the limited sample provided above, this the actual page may have unexpected results.

answered Aug 12, 2016 at 23:02

Dave2e

24.3k18 gold badges46 silver badges57 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

kimbekaw Over a year ago

Thanks, @Dave2e! This worked perfectly! I had a few additional characters to clean up, but that was easy to do with your examples. On to the rest of the data cleaning! :o

Collectives™ on Stack Overflow

R Read & Parse HTML to List

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related