Error when web scraping in R: Error in UseMethod("xml_find_all") :

Ask Question

Asked 3 years, 11 months ago

Modified 3 years, 11 months ago

Viewed 137 times

Part of R Language Collective

I wrote some code to webscrape air quality data in R. It worked perfectly fine and I had no issues. But now, when I recently reran it, I'm getting an error when using the html_nodes() function.

Here is my code:

library(rvest)
library(tidyverse)
library(lubridate)


## Download MOE Location Data
# https://stackoverflow.com/questions/25677035/how-to-create-a-range-of-dates-in-r

## Create a tibble of dates
start_date <- "2021/1/1"
end_date <- "2021/12/31"

dates <- seq(as.Date(start_date), as.Date(end_date), "days")

df <- NULL

for (datex in dates) {
  datef = as.Date(datex, origin = "1970-01-01")
  Day = day(datef)
  Month = month(datef)
  Year = year(datef)
  for (hour in 1:24) {
    url.new <-
      paste(
        "http://www.airqualityontario.com/aqhi/locations.php?start_day=",
        Day,
        "&start_month=",
        Month,
        "&start_year=",
        Year,
        "&my_hour=",
        hour,
        "&pol=36&text_only=1&Submit=Update",
        sep = ""
      )
    download.file(url.new, destfile = "scrapedpage.html", quiet=TRUE)
    simple <- read_html("scrapedpage.html")
    test <- simple %>%
      html_nodes("td") %>%
      html_text()
    test <- as_tibble(test)
    df.temp <-
      as.data.frame(matrix(
        unlist(test, use.names = FALSE),
        ncol = 3,
        byrow = TRUE
      )) %>%
      mutate(date = paste(datef)) %>%
      mutate(hour = hour)
    df <- rbind(df, df.temp)
    
  }
}


df <- as_tibble(df)

colnames(df) <- c("Station","Address","SurfaceConc","SurfaceDate","Hour")

MOE_data <- df %>%
  filter(Address != "Bay St. Wellesley St. W.") %>%
  select(-Address) %>%
  mutate(Station = trimws(Station)) %>%
  # filter(str_detect(Station, 'Toronto')) %>%
  mutate(Hour = paste(Hour, ":00:00", sep = "")) %>%
  mutate(Hour = hms::as_hms(Hour)) %>%
  mutate(SurfaceDate = paste(SurfaceDate, Hour)) %>%
  mutate(SurfaceDate = as_datetime(SurfaceDate)) %>%
  select(-Hour) 

MOE_data <- as_tibble(MOE_data)

rm(list=setdiff(ls(), "MOE_data_2021"))
# save.image(file='Jan2019_Dec2021.RData')

This is the error I get:

Error in UseMethod("xml_find_all") : 
  no applicable method for 'xml_find_all' applied to an object of class "xml_document"

What I don't understand is why it happens for some values, some of the time. For example, I get an error when the hour = 16. But when I rerun it, it may work, it's just not consistent.

asked Jan 7, 2022 at 16:16

Priya Patel

1719 bronze badges

1

It looks like you are downloading and then reading the html page. There may be problems with download taking longer than expected thus generating the error. A couple of things to try is to put a slight pause in after the download.file, Sys.sleep(0.5) or try reading the url directly with read_html(url.new)

Dave2e
– Dave2e

2022-01-07 18:32:21 +00:00
Commented Jan 7, 2022 at 18:32
1

I used the second suggestion and it worked. Thank you!

Priya Patel
– Priya Patel

2022-01-07 23:12:44 +00:00
Commented Jan 7, 2022 at 23:12

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Error when web scraping in R: Error in UseMethod("xml_find_all") :

0

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Linked