I am performing web-scraping in R (using rvest) for a specific set of data on various webpages. All of the webpages are formatted the same, so I can extract the targeted data from its placement on each page, using the correct node with no problem. However, there are 100 different web pages, all with the same url (except for the very end). Is there a way to use a loop to perform the process automatically?
I am using the following code:
webpage_urls <- paste0("https://exampleurl=", endings)
where endings is a vector of the 100 endings that give the separate webpages.
and then
htmltemplate <- read_html(webpage_urls)
however, I then receive Error: `x` must be a string of length 1
After this step, I would like to perform the follow extraction:
webscraping <- htmltemplate %>%
html_nodes("td") %>%
html_text()
nth_element <- function(vector, starting_position, n) {vector[seq(starting_position, length(vector), n)]}
result <- nth_element(webscraping, 10, 5)
The code for extraction all works individually when I do it manually for each webpage, however I cannot repeat the function automatically for each webpage.
I am rather unfamiliar with loops/iteration and how to code it. Is there a way to run this extraction process for each webpage, and then to store the result of each extraction process to a separate vector, so that I can compile them in a table? If not a loop, is there another way to automate the process so that I can get past the error demanding a single string?
read_htmlwants a single URL, not 100 of them, I think the error is clear here. Have you verified that your code works with a single URL? (i.e.,read_html(webpage_urls[1]))allresults <- lapply(webpage_urls, function(oneurl) { htmltemplate <- read_html(oneurl); ...; })will create a list of all results, one url per list element.