Use a loop/automation for html web scraping

Question

I am performing web-scraping in R (using rvest) for a specific set of data on various webpages. All of the webpages are formatted the same, so I can extract the targeted data from its placement on each page, using the correct node with no problem. However, there are 100 different web pages, all with the same url (except for the very end). Is there a way to use a loop to perform the process automatically?

I am using the following code:

webpage_urls <- paste0("https://exampleurl=", endings)

where endings is a vector of the 100 endings that give the separate webpages.

and then

htmltemplate <- read_html(webpage_urls)

however, I then receive Error: `x` must be a string of length 1

After this step, I would like to perform the follow extraction:

webscraping <- htmltemplate %>%
html_nodes("td") %>%
html_text()

nth_element <- function(vector, starting_position, n) {vector[seq(starting_position, length(vector), n)]}

result <- nth_element(webscraping, 10, 5)

The code for extraction all works individually when I do it manually for each webpage, however I cannot repeat the function automatically for each webpage.

I am rather unfamiliar with loops/iteration and how to code it. Is there a way to run this extraction process for each webpage, and then to store the result of each extraction process to a separate vector, so that I can compile them in a table? If not a loop, is there another way to automate the process so that I can get past the error demanding a single string?

read_html wants a single URL, not 100 of them, I think the error is clear here. Have you verified that your code works with a single URL? (i.e., read_html(webpage_urls[1])) — r2evans
– r2evans, Commented Jul 5, 2022 at 16:54
Yes, the code works for a single URL. My question is how to automate it so that it can perform the html read (and the following webscraping extraction) for each webpage. Do you know how to repeat/automate that function? — flâneur
– flâneur, Commented Jul 5, 2022 at 17:00
allresults <- lapply(webpage_urls, function(oneurl) { htmltemplate <- read_html(oneurl); ...; }) will create a list of all results, one url per list element. — r2evans
– r2evans, Commented Jul 5, 2022 at 17:10
Thank you, that is what I would like to do, but I am a bit confused. How does your solution fit with the rest of my code and the function? How would it look altogether? — flâneur
– flâneur, Commented Jul 5, 2022 at 17:18

r2evans · Accepted Answer · 2022-07-05 17:22:44Z

1

nth_element <- function(vector, starting_position, n) {vector[seq(starting_position, length(vector), n)]}

allresults <- lapply(webpage_urls, function(oneurl) {
  read_html(oneurl) %>%
    html_nodes("td") %>%
    html_text() %>%
    nth_element(10, 5)
})

answered Jul 5, 2022 at 17:22

r2evans

167k8 gold badges92 silver badges176 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Use a loop/automation for html web scraping

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related