2

I have been trying to read & parse a bit of HTML to obtain a list of conditions for animals at an animal shelter. I'm sure my inexperience with HTML parsing isn't helping, but I seem to be getting no where fast.

Here's a snippet of the HTML:

<select multiple="true" name="asilomarCondition" id="asilomarCondition">

    <option value="101">
        Behavior- Aggression, Confrontational-Toward People (mild)
        -
        TM</option>
....
</select>

There's only one tag with <select...> and the rest are all <option value=x>.

I've been using the XML library. I can remove the newlines and tabs, but haven't had any success removing the tags:

conditions.html <- paste(readLines("Data/evalconditions.txt"), collapse="\n")
conditions.text <- gsub('[\t\n]',"",conditions.html)

As a final result, I'd like a list of all of the conditions that I can process further for later use as factor names:

Behavior- Aggression, Confrontational-Toward People (mild)-TM
Behavior- Aggression, Confrontational-Toward People (moderate/severe)-UU
...

I'm not sure if I need to use the XML library (or another library) or if gsub patterns would be sufficient (either way, I need to work out how to use it).

5
  • Can you point to the full URL with that select box or expand the snippet a bit? Commented Aug 11, 2016 at 23:20
  • 3
    I find the rvest package easier to use. If you can provide a link to the website, someone could code up a solution of you. Commented Aug 11, 2016 at 23:43
  • it's HTML. it's a select list in a form @alistaire Commented Aug 12, 2016 at 0:56
  • 1
    Oops, true. library(rvest) ; html %>% read_html() %>% html_nodes('option') %>% html_text(trim = TRUE) Commented Aug 12, 2016 at 1:29
  • Unfortunately, I can't provide a URL. It is an online DBMS with user access only. I'm a volunteer with the shelter trying to help with some data analysis. I could pull the whole page, but there's likely sensitive data in there. I just took one animal instance to get at the part I need. I can post the entire snippet I pulled if that would be useful. I'll look into the rvest library, though! Commented Aug 12, 2016 at 1:35

1 Answer 1

2

Here is a start using the rvest package:

library(rvest)
#read the html page
page<-read_html("test.html")
#get the text from the "option" nodes and then trim the whitespace
nodes<-trimws(html_text(html_nodes(page, "option")))

#nodes will need additional clean up to remove the excessive spaces 
#and newline characters
nodes<-gsub("\n", "", nodes)
nodes<-gsub("  ", "", nodes)

The vector nodes should be the result which you requested. This example is based on the limited sample provided above, this the actual page may have unexpected results.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks, @Dave2e! This worked perfectly! I had a few additional characters to clean up, but that was easy to do with your examples. On to the rest of the data cleaning! :o

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.