0

I wrote some code to webscrape air quality data in R. It worked perfectly fine and I had no issues. But now, when I recently reran it, I'm getting an error when using the html_nodes() function.

Here is my code:

library(rvest)
library(tidyverse)
library(lubridate)


## Download MOE Location Data
# https://stackoverflow.com/questions/25677035/how-to-create-a-range-of-dates-in-r

## Create a tibble of dates
start_date <- "2021/1/1"
end_date <- "2021/12/31"

dates <- seq(as.Date(start_date), as.Date(end_date), "days")

df <- NULL

for (datex in dates) {
  datef = as.Date(datex, origin = "1970-01-01")
  Day = day(datef)
  Month = month(datef)
  Year = year(datef)
  for (hour in 1:24) {
    url.new <-
      paste(
        "http://www.airqualityontario.com/aqhi/locations.php?start_day=",
        Day,
        "&start_month=",
        Month,
        "&start_year=",
        Year,
        "&my_hour=",
        hour,
        "&pol=36&text_only=1&Submit=Update",
        sep = ""
      )
    download.file(url.new, destfile = "scrapedpage.html", quiet=TRUE)
    simple <- read_html("scrapedpage.html")
    test <- simple %>%
      html_nodes("td") %>%
      html_text()
    test <- as_tibble(test)
    df.temp <-
      as.data.frame(matrix(
        unlist(test, use.names = FALSE),
        ncol = 3,
        byrow = TRUE
      )) %>%
      mutate(date = paste(datef)) %>%
      mutate(hour = hour)
    df <- rbind(df, df.temp)
    
  }
}


df <- as_tibble(df)

colnames(df) <- c("Station","Address","SurfaceConc","SurfaceDate","Hour")

MOE_data <- df %>%
  filter(Address != "Bay St. Wellesley St. W.") %>%
  select(-Address) %>%
  mutate(Station = trimws(Station)) %>%
  # filter(str_detect(Station, 'Toronto')) %>%
  mutate(Hour = paste(Hour, ":00:00", sep = "")) %>%
  mutate(Hour = hms::as_hms(Hour)) %>%
  mutate(SurfaceDate = paste(SurfaceDate, Hour)) %>%
  mutate(SurfaceDate = as_datetime(SurfaceDate)) %>%
  select(-Hour) 

MOE_data <- as_tibble(MOE_data)

rm(list=setdiff(ls(), "MOE_data_2021"))
# save.image(file='Jan2019_Dec2021.RData')

This is the error I get:

Error in UseMethod("xml_find_all") : 
  no applicable method for 'xml_find_all' applied to an object of class "xml_document"

What I don't understand is why it happens for some values, some of the time. For example, I get an error when the hour = 16. But when I rerun it, it may work, it's just not consistent.

2
  • 1
    It looks like you are downloading and then reading the html page. There may be problems with download taking longer than expected thus generating the error. A couple of things to try is to put a slight pause in after the download.file, Sys.sleep(0.5) or try reading the url directly with read_html(url.new) Commented Jan 7, 2022 at 18:32
  • 1
    I used the second suggestion and it worked. Thank you! Commented Jan 7, 2022 at 23:12

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.