I have been trying to read & parse a bit of HTML to obtain a list of conditions for animals at an animal shelter. I'm sure my inexperience with HTML parsing isn't helping, but I seem to be getting no where fast.
Here's a snippet of the HTML:
<select multiple="true" name="asilomarCondition" id="asilomarCondition">
<option value="101">
Behavior- Aggression, Confrontational-Toward People (mild)
-
TM</option>
....
</select>
There's only one tag with <select...> and the rest are all <option value=x>.
I've been using the XML library. I can remove the newlines and tabs, but haven't had any success removing the tags:
conditions.html <- paste(readLines("Data/evalconditions.txt"), collapse="\n")
conditions.text <- gsub('[\t\n]',"",conditions.html)
As a final result, I'd like a list of all of the conditions that I can process further for later use as factor names:
Behavior- Aggression, Confrontational-Toward People (mild)-TM
Behavior- Aggression, Confrontational-Toward People (moderate/severe)-UU
...
I'm not sure if I need to use the XML library (or another library) or if gsub patterns would be sufficient (either way, I need to work out how to use it).
library(rvest) ; html %>% read_html() %>% html_nodes('option') %>% html_text(trim = TRUE)