0

I am performing web-scraping in R (using rvest) for a specific set of data on various webpages. All of the webpages are formatted the same, so I can extract the targeted data from its placement on each page, using the correct node with no problem. However, there are 100 different web pages, all with the same url (except for the very end). Is there a way to use a loop to perform the process automatically?

I am using the following code:

webpage_urls <- paste0("https://exampleurl=", endings)

where endings is a vector of the 100 endings that give the separate webpages.

and then

htmltemplate <- read_html(webpage_urls)

however, I then receive Error: `x` must be a string of length 1

After this step, I would like to perform the follow extraction:

webscraping <- htmltemplate %>%
html_nodes("td") %>%
html_text()

nth_element <- function(vector, starting_position, n) {vector[seq(starting_position, length(vector), n)]}

result <- nth_element(webscraping, 10, 5) 

The code for extraction all works individually when I do it manually for each webpage, however I cannot repeat the function automatically for each webpage.

I am rather unfamiliar with loops/iteration and how to code it. Is there a way to run this extraction process for each webpage, and then to store the result of each extraction process to a separate vector, so that I can compile them in a table? If not a loop, is there another way to automate the process so that I can get past the error demanding a single string?

4
  • read_html wants a single URL, not 100 of them, I think the error is clear here. Have you verified that your code works with a single URL? (i.e., read_html(webpage_urls[1])) Commented Jul 5, 2022 at 16:54
  • Yes, the code works for a single URL. My question is how to automate it so that it can perform the html read (and the following webscraping extraction) for each webpage. Do you know how to repeat/automate that function? Commented Jul 5, 2022 at 17:00
  • allresults <- lapply(webpage_urls, function(oneurl) { htmltemplate <- read_html(oneurl); ...; }) will create a list of all results, one url per list element. Commented Jul 5, 2022 at 17:10
  • Thank you, that is what I would like to do, but I am a bit confused. How does your solution fit with the rest of my code and the function? How would it look altogether? Commented Jul 5, 2022 at 17:18

1 Answer 1

1
nth_element <- function(vector, starting_position, n) {vector[seq(starting_position, length(vector), n)]}

allresults <- lapply(webpage_urls, function(oneurl) {
  read_html(oneurl) %>%
    html_nodes("td") %>%
    html_text() %>%
    nth_element(10, 5)
})
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.