how to scrape text from a HTML body

Question

I've never scraped. Would it be straightforward to scrape the text in the main, big gray box only from the link below (starting with header SRUS43 KMSR 271039, ending with .END)? My end goal is to basically have three tidy columns of data from all that text: the five digit codes, the values in inches, and the basin elevation descriptions, so any pointers with processing the text format are welcome, too.

https://www.nohrsc.noaa.gov/shef_archive/index.html?rfc=cnrfc&product=swe&year=2019&month=3&day=27&hour=6

thank you for any help.

Possible duplicate of Is there a simple way in R to extract only the text elements of an HTML page? — divibisan
– divibisan, Commented Mar 27, 2019 at 18:54

JasonAizkalns · Accepted Answer · 2019-03-27 19:32:10Z

2

Reading in the text is fairly easy (see @DiceBoyT answer). Cleaning up the format for three columns is a bit more involved. Below could use some clean-up (especially with the regex), but it gets the job done:

library(tidyverse)
library(rvest)

text <- read_html("https://www.nohrsc.noaa.gov/shef_archive/index.html?rfc=cnrfc&product=swe&year=2019&month=3&day=27&hour=6") %>% 
  html_node(".notes") %>% 
  html_text() 

df <- tibble(txt = read_lines(text))

df %>%
  mutate(
    row = row_number(),
    with_code = str_extract(txt, "^[A-z0-9]{5}\\s+\\d+(\\.)?\\d"),
    wo_code = str_extract(txt, "^:?\\s+\\d+(\\.)?\\d") %>% str_extract("[:digit:]+\\.?[:digit:]"),
    basin_desc = if_else(!is.na(with_code), lag(txt, 1), NA_character_) %>% str_sub(start = 2)
  ) %>% 
  separate(with_code, c("code", "val"), sep = "\\s+") %>% 
  mutate(
    combined_val = case_when(
      !is.na(val) ~ val,
      !is.na(wo_code) ~ wo_code,
      TRUE ~ NA_character_
    ) %>% as.numeric
  ) %>%
  filter(!is.na(combined_val)) %>%
  mutate(
    code = zoo::na.locf(code),
    basin_desc = zoo::na.locf(basin_desc)
  ) %>%
  select(
    code, combined_val, basin_desc
  )
#> # A tibble: 643 x 3
#>    code  combined_val basin_desc               
#>    <chr>        <dbl> <chr>                    
#>  1 ACSC1          0   San Antonio Ck - Sunol   
#>  2 ADLC1          0   Arroyo De La Laguna      
#>  3 ADOC1          0   Santa Ana R - Prado Dam  
#>  4 AHOC1          0   Arroyo Honda nr San Jose 
#>  5 AKYC1         41   SF American nr Kyburz    
#>  6 AKYC1          3.2 SF American nr Kyburz    
#>  7 AKYC1         42.2 SF American nr Kyburz    
#>  8 ALQC1          0   Alamo Canal nr Pleasanton
#>  9 ALRC1          0   Alamitos Ck - Almaden Res
#> 10 ANDC1          0   Coyote Ck - Anderson Res 
#> # ... with 633 more rows

^{Created on 2019-03-27 by the reprex package (v0.2.1)}

answered Mar 27, 2019 at 19:32

JasonAizkalns

20.6k8 gold badges65 silver badges127 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

dbo Over a year ago

Wow, thank you! I'm probably asking too much, but what if I wanted the basin_desc to be the elevation description instead of the location description, such as: Entire Basin, Base to 5000', 5000' to Top, etc?

dbo Over a year ago

For the comment above, I worked through each step of @JasonAizkalns and came up successfully with: df <- df %>% mutate(elevation_zone = gsub(".*(inches))","",txt))

dave-edison · Accepted Answer · 2019-03-27 18:28:59Z

1

This is pretty straightforward to scrape with rvest:

library(rvest)

text <- read_html("https://www.nohrsc.noaa.gov/shef_archive/index.html?rfc=cnrfc&product=swe&year=2019&month=3&day=27&hour=6") %>% 
  html_node(".notes") %>% 
  html_text()

answered Mar 27, 2019 at 18:28

dave-edison

3,7469 silver badges20 bronze badges

2 Comments

dbo Over a year ago

glad to hear, though I'm getting

Error in open.connection(x, "rb") :    Could not resolve host: www.nohrsc.noaa.gov Calls: %>% -> eval -> eval -> read_html -> read_html.default Execution halted

dave-edison Over a year ago

Try restarting R and running the code, it works fine for me with a fresh R session.

Collectives™ on Stack Overflow

how to scrape text from a HTML body

2 Answers 2

2 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related