0

I am scraping tables from a website and have been scraping each web page one at a time but since the urls follow a pattern I am thinking of running the urls through a for loop.

I am trying to use the following script:

for(i in 1:38) {
  webpage <- read_html(paste0("www.website.com/", i))
  data <- webpage %>%
    html_nodes("table") %>%
    .[[1]] %>% 
    html_table()
}

My main issue is that the sites I am scraping do not follow a pattern I am able to put in the above for loop, but rather read as the following (if the /W wasn't included it would make it a lot easier): www.website.com/sample/test-01/W, www.website.com/sample/test-02/W, www.website.com/sample/test-03/W etc.

I feel as though there is an extremely simple way to place these into the above for loop but I am not sure of the syntax.

EDIT: one more issue is the 0 in the url www.website.com/sample/test-01/W. I can't paste the i after the 0 since the pattern goes 06-07-08-09-10-11 with the 0 not being valid after 09. And the website www.website.com/sample/test-012/W does not exist.

2
  • Did you try concatenating the W part to the webpage variable? Commented Jan 29, 2022 at 7:38
  • Edited post to include an additional issue. Commented Jan 29, 2022 at 7:52

2 Answers 2

1

In order to append the \W at the end, you just need to use the pate0 function once again on the webpage.

for(i in 1:38) {
  webpage <- paste0("www.website.com/", i)
  temp <- paste0(webpage, "/W")

It will make the URL look like this:

www.website.com/1/W
www.website.com/2/W
...

To get the digits part, you can use the sprintf from base R. To get two-digit numbers you'll have to use sprintf("%02d", i) in a loop.

The code will look like this:

for(i in 1:38) {
  webpage <- paste0("www.website.com/", sprintf("%02d", i))
  temp <- paste0(webpage, "/W")
  print(temp)
}

Note: I've modified the code to prove my point.

The output will look like this:

[1] "www.website.com/01/W"
[1] "www.website.com/02/W"
[1] "www.website.com/03/W"
[1] "www.website.com/04/W"
[1] "www.website.com/05/W"
[1] "www.website.com/06/W"
[1] "www.website.com/07/W"
[1] "www.website.com/08/W"
[1] "www.website.com/09/W"
[1] "www.website.com/10/W"
[1] "www.website.com/11/W"
[1] "www.website.com/12/W"
[1] "www.website.com/13/W"
[1] "www.website.com/14/W"
[1] "www.website.com/15/W"
[1] "www.website.com/16/W"
[1] "www.website.com/17/W"
[1] "www.website.com/18/W"
[1] "www.website.com/19/W"
[1] "www.website.com/20/W"
[1] "www.website.com/21/W"
[1] "www.website.com/22/W"
[1] "www.website.com/23/W"
[1] "www.website.com/24/W"
[1] "www.website.com/25/W"
[1] "www.website.com/26/W"
[1] "www.website.com/27/W"
[1] "www.website.com/28/W"
[1] "www.website.com/29/W"
[1] "www.website.com/30/W"
[1] "www.website.com/31/W"
[1] "www.website.com/32/W"
[1] "www.website.com/33/W"
[1] "www.website.com/34/W"
[1] "www.website.com/35/W"
[1] "www.website.com/36/W"
[1] "www.website.com/37/W"
[1] "www.website.com/38/W"
Sign up to request clarification or add additional context in comments.

Comments

1

You may create a vector of urls using sprintf -

web_urls <- sprintf('www.website.com/test-%02d/W', 1:38)

Then use lapply to extract the table from each url.

library(rvest)

extract_table <- function(url) {
  webpage <- read_html(url)
  data <- webpage %>%
    html_nodes("table") %>%
    .[[1]] %>% 
    html_table()
}

result <- lapply(web_urls, extract_table)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.