3

I have a html data set as below, which I want to parse and convert into a tabular format which I can use .

<!DOCTYPE html>
<html>

<head>
    <title>Page Title</title>
</head>

<body>
    <div class="brewery" id="brewery">
        <ul class="vcard simple">
            <li class="name"> Bradley Farm / RB Brew, LLC</li>
            <li class="address">317 Springtown Rd </li>
            <li class="address_2">New Paltz, NY 12561-3020 | <a href='http://www.google.com/maps/place/317 Springtown Rd++New Paltz+NY+United States' target='_blank'>Map</a> </li>
            <li class="telephone">Phone: (845) 255-8769</li>
            <li class="brewery_type">Type: Micro</li>
            <li class="url"><a href="http://www.raybradleyfarm.com" target="_blank">www.raybradleyfarm.com</a> </li>
        </ul>
        <ul class="vcard simple col2"></ul>
    </div>
    <div class="brewery">
        <ul class="vcard simple">
            <li class="name">(405) Brewing Co</li>
            <li class="address">1716 Topeka St </li>
            <li class="address_2">Norman, OK 73069-8224 | <a href='http://www.google.com/maps/place/1716 Topeka St++Norman+OK+United States' target='_blank'>Map</a> </li>
            <li class="telephone">Phone: (405) 816-0490</li>
            <li class="brewery_type">Type: Micro</li>
            <li class="url"><a href="http://www.405brewing.com" target="_blank">www.405brewing.com</a> </li>
        </ul>
        <ul class="vcard simple col2"></ul>
    </div>
</body>

Below is the code which I have used. The issue I am facing is it converts into text file using Rvest but cant seem to make it of any useful format.

library(dplyr)
library(rvest)

url<-html("beer.html")
selector_name<-".brewery"
fnames<-html_nodes(x = url, css = selector_name) %>%
html_text()
head(fnames)
fnames

Would this be a correct approach or should I be doing it using some other package to go through each div and the inner elements.

The out put I would like to see it is

No.  Name  Address Type Website

Thank You.

2 Answers 2

7
library(rvest)
library(dplyr)

html_file <- '<!DOCTYPE html>
<html>

<head>
    <title>Page Title</title>
</head>

<body>
    <div class="brewery" id="brewery">
        <ul class="vcard simple">
            <li class="name"> Bradley Farm / RB Brew, LLC</li>
            <li class="address">317 Springtown Rd </li>
            <li class="address_2">New Paltz, NY 12561-3020 | <a href="http://www.google.com/maps/place/317 Springtown Rd++New Paltz+NY+United States" target="_blank">Map</a> </li>
            <li class="telephone">Phone: (845) 255-8769</li>
            <li class="brewery_type">Type: Micro</li>
            <li class="url"><a href="http://www.raybradleyfarm.com" target="_blank">www.raybradleyfarm.com</a> </li>
        </ul>
        <ul class="vcard simple col2"></ul>
    </div>
    <div class="brewery">
        <ul class="vcard simple">
            <li class="name">(405) Brewing Co</li>
            <li class="address">1716 Topeka St </li>
            <li class="address_2">Norman, OK 73069-8224 | <a href="http://www.google.com/maps/place/1716 Topeka St++Norman+OK+United States" target="_blank">Map</a> </li>
            <li class="telephone">Phone: (405) 816-0490</li>
            <li class="brewery_type">Type: Micro</li>
            <li class="url"><a href="http://www.405brewing.com" target="_blank">www.405brewing.com</a> </li>
        </ul>
        <ul class="vcard simple col2"></ul>
    </div>
</body>'

page <- read_html(html_file) 

tibble(
  name = page %>% html_nodes(".vcard .name") %>% html_text(),
  address = page %>% html_nodes(".vcard .address") %>% html_text(),
  type = page %>% html_nodes(".vcard .brewery_type") %>% html_text() %>% stringr::str_replace_all("^Type: ", ""),
  website = page %>% html_nodes(".vcard .url a") %>% html_attr("href")
)

#> # A tibble: 2 x 4
#>                           name            address  type                       website
#>                          <chr>              <chr> <chr>                         <chr>
#> 1  Bradley Farm / RB Brew, LLC 317 Springtown Rd  Micro http://www.raybradleyfarm.com
#> 2             (405) Brewing Co    1716 Topeka St  Micro     http://www.405brewing.com
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks a lot @austensen. My only error which I get is while running this on the whole file for type . Would be something to do while we are trying to replace blank type values . ` Error: Column type must be length 1 or 7263, not 7147 `
Oh that sounds like, unlike your example, there are some breweries that are missing the type field in your real data, and so that column in your dataframe is of a different length. I'd have to think a bit more about how to solve that.
2

The problem is that it's not a table, so it's not super easy to parse. It's just two lists, which the below code concatenates into one list. Also FYI, try looking into the xml2 package for parsing html/xml.

library(dplyr)
library(rvest)
library(xml2)

vcard <- 
  '<!DOCTYPE html>
  <html>

  <head>
  <title>Page Title</title>
  </head>

  <body>
  <div class="brewery" id="brewery">
  <ul class="vcard simple">
  <li class="name"> Bradley Farm / RB Brew, LLC</li>
  <li class="address">317 Springtown Rd </li>
  <li class="address_2">New Paltz, NY 12561-3020 | <a href=\'http://www.google.com/maps/place/317 Springtown Rd++New Paltz+NY+United States\' target=\'_blank\'>Map</a> </li>
  <li class="telephone">Phone: (845) 255-8769</li>
  <li class="brewery_type">Type: Micro</li>
  <li class="url"><a href="http://www.raybradleyfarm.com" target="_blank">www.raybradleyfarm.com</a> </li>
  </ul>
  <ul class="vcard simple col2"></ul>
  </div>
  <div class="brewery">
  <ul class="vcard simple">
  <li class="name">(405) Brewing Co</li>
  <li class="address">1716 Topeka St </li>
  <li class="address_2">Norman, OK 73069-8224 | <a href=\'http://www.google.com/maps/place/1716 Topeka St++Norman+OK+United States\' target=\'_blank\'>Map</a> </li>
  <li class="telephone">Phone: (405) 816-0490</li>
  <li class="brewery_type">Type: Micro</li>
  <li class="url"><a href="http://www.405brewing.com" target="_blank">www.405brewing.com</a> </li>
  </ul>
  <ul class="vcard simple col2"></ul>
  </div>
  </body>' %>% 
    read_html(html) %>% 
    xml_find_all("//ul[@class = 'vcard simple']")

two_children <- sapply(vcard, function(x) xml2::xml_children(x))

data.frame(
  class = sapply(two_children, function(x) xml2::xml_attrs(x)),
  value = sapply(two_children, function(x) xml2::xml_text(x)),
  stringsAsFactors = FALSE
)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.