Web scraping into R multiple links with similar URL using a for loop or lapply

Question

This code scrapes from here http://www.bls.gov/schedule/news_release/2015_sched.htm every Date that contains Employment Situation under the Release column.

pg <- read_html("http://www.bls.gov/schedule/news_release/2015_sched.htm")

# target only the <td> elements under the bodytext div
body <- html_nodes(pg, "div#bodytext")

# we use this new set of nodes and a relative XPath to get the initial <td> elements, then get their siblings
es_nodes <- html_nodes(body, xpath=".//td[contains(., 'Employment Situation for')]/../td[1]")

# clean up the cruft and make our dates!
nfpdates2015 <- as.Date(trimws(html_text(es_nodes)), format="%A, %B %d, %Y")

###thanks @hrbrmstr for this###

I would like to repeat that for other URLs, containing other years, named in the same way with only the year number changing. Particularly, for the following URLs:

#From 2008 to 2015
http://www.bls.gov/schedule/news_release/2015_sched.htm
http://www.bls.gov/schedule/news_release/2014_sched.htm
...
http://www.bls.gov/schedule/news_release/2008_sched.htm

My knowledge of rvest, HTML and XML is almost non-existent. I thought to apply the same code with a for loop, but my efforts were futile. Of course I could just repeat the code for 2015 eight times to get all years, it would neither take too long nor too much space. Yet I am very curious to know how this could be done in a more efficient way. Thank you.

SymbolixAU · Accepted Answer · 2016-05-01 06:19:40Z

7

In a loop you would change the url string using a paste0 statment

for(i in 2008:2015){

  url <- paste0("http://www.bls.gov/schedule/news_release/", i, "_sched.htm")
  pg <- read_html(url)

  ## all your other code goes here.

}

Or using an lapply to return a list of the results.

lst <- lapply(2008:2015, function(x){
  url <- paste0("http://www.bls.gov/schedule/news_release/", x, "_sched.htm")

  ## all your other code goes here.
  pg <- read_html(url)

  # target only the <td> elements under the bodytext div
  body <- html_nodes(pg, "div#bodytext")

  # we use this new set of nodes and a relative XPath to get the initial <td> elements, then get their siblings
  es_nodes <- html_nodes(body, xpath=".//td[contains(., 'Employment Situation for')]/../td[1]")

  # clean up the cruft and make our dates!
  nfpdates <- as.Date(trimws(html_text(es_nodes)), format="%A, %B %d, %Y")
  return(nfpdates)
})

Which returns

 lst
[[1]]
 [1] "2008-01-04" "2008-02-01" "2008-03-07" "2008-04-04" "2008-05-02" "2008-06-06" "2008-07-03" "2008-08-01" "2008-09-05"
[10] "2008-10-03" "2008-11-07" "2008-12-05"

[[2]]
 [1] "2009-01-09" "2009-02-06" "2009-03-06" "2009-04-03" "2009-05-08" "2009-06-05" "2009-07-02" "2009-08-07" "2009-09-04"
[10] "2009-10-02" "2009-11-06" "2009-12-04"

## etc...

edited May 1, 2016 at 6:19

answered May 1, 2016 at 6:03

SymbolixAU

26.4k4 gold badges72 silver badges148 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Krug Over a year ago

Thank you Symbolix! Both methods work great, and lapply is considerably faster. Yet neither records a variable with the dates for all 8 years (or 8 different variables). In both cases nfpdates only stores the last year (i.e. 2015). How could this be achieved?

SymbolixAU Over a year ago

@Gracos the lapply returns a list (of length 8). If you assign the lapply to a varaible you can then access all the returned results. See my update.

akrun · Accepted Answer · 2016-05-01 06:23:57Z

6

This can be done with sprintf (with no loops)

url <- sprintf("http://www.bls.gov/schedule/news_release/%d_sched.htm", 2008:2015)
url
#[1] "http://www.bls.gov/schedule/news_release/2008_sched.htm" "http://www.bls.gov/schedule/news_release/2009_sched.htm"
#[3] "http://www.bls.gov/schedule/news_release/2010_sched.htm" "http://www.bls.gov/schedule/news_release/2011_sched.htm"
#[5] "http://www.bls.gov/schedule/news_release/2012_sched.htm" "http://www.bls.gov/schedule/news_release/2013_sched.htm"
#[7] "http://www.bls.gov/schedule/news_release/2014_sched.htm" "http://www.bls.gov/schedule/news_release/2015_sched.htm"

and if we need to read the links

library(rvest)
lst <-  lapply(url, function(x) {

   pg <- read_html(x)
   body <- html_nodes(pg, "div#bodytext")
   es_nodes <- html_nodes(body, xpath=".//td[contains(., 'Employment Situation for')]/../td[1]")

   nfpdates <- as.Date(trimws(html_text(es_nodes)), format="%A, %B %d, %Y")
   nfpdates
  })

head(lst, 3)
#[[1]]
# [1] "2008-01-04" "2008-02-01" "2008-03-07" "2008-04-04" "2008-05-02" "2008-06-06" "2008-07-03" "2008-08-01"
# [9] "2008-09-05" "2008-10-03" "2008-11-07" "2008-12-05"

#[[2]]
# [1] "2009-01-09" "2009-02-06" "2009-03-06" "2009-04-03" "2009-05-08" "2009-06-05" "2009-07-02" "2009-08-07"
# [9] "2009-09-04" "2009-10-02" "2009-11-06" "2009-12-04"

#[[3]]
# [1] "2010-01-08" "2010-02-05" "2010-03-05" "2010-04-02" "2010-05-07" "2010-06-04" "2010-07-02" "2010-08-06"
# [9] "2010-09-03" "2010-10-08" "2010-11-05" "2010-12-03"

edited May 1, 2016 at 6:23

answered May 1, 2016 at 6:06

akrun

891k38 gold badges590 silver badges700 bronze badges

4 Comments

Krug Over a year ago

Thanks very much akrun, much appreciated. Your answer is almost identical to Symbolix's, I'll accept his as came in first. I'm out of upvotes for the following 20 hours.

akrun Over a year ago

@Gracos Yes, it is the paste vs sprintf. Otherwise, you have almost done all the groundwork

SymbolixAU Over a year ago

@akrun off the top of your head, do you know of any benefit to using either of sprintf or paste0?

akrun Over a year ago

@Symbolix I don't think there is any difference in speed, but with sprintf, you can use it on a single string, while with paste0 we are pasting multiple substrings together.

Collectives™ on Stack Overflow

Web scraping into R multiple links with similar URL using a for loop or lapply

2 Answers 2

2 Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related