5

This code scrapes from here http://www.bls.gov/schedule/news_release/2015_sched.htm every Date that contains Employment Situation under the Release column.

pg <- read_html("http://www.bls.gov/schedule/news_release/2015_sched.htm")

# target only the <td> elements under the bodytext div
body <- html_nodes(pg, "div#bodytext")

# we use this new set of nodes and a relative XPath to get the initial <td> elements, then get their siblings
es_nodes <- html_nodes(body, xpath=".//td[contains(., 'Employment Situation for')]/../td[1]")

# clean up the cruft and make our dates!
nfpdates2015 <- as.Date(trimws(html_text(es_nodes)), format="%A, %B %d, %Y")

###thanks @hrbrmstr for this###

I would like to repeat that for other URLs, containing other years, named in the same way with only the year number changing. Particularly, for the following URLs:

#From 2008 to 2015
http://www.bls.gov/schedule/news_release/2015_sched.htm
http://www.bls.gov/schedule/news_release/2014_sched.htm
...
http://www.bls.gov/schedule/news_release/2008_sched.htm

My knowledge of rvest, HTML and XML is almost non-existent. I thought to apply the same code with a for loop, but my efforts were futile. Of course I could just repeat the code for 2015 eight times to get all years, it would neither take too long nor too much space. Yet I am very curious to know how this could be done in a more efficient way. Thank you.

2 Answers 2

7

In a loop you would change the url string using a paste0 statment

for(i in 2008:2015){

  url <- paste0("http://www.bls.gov/schedule/news_release/", i, "_sched.htm")
  pg <- read_html(url)

  ## all your other code goes here.

}

Or using an lapply to return a list of the results.

lst <- lapply(2008:2015, function(x){
  url <- paste0("http://www.bls.gov/schedule/news_release/", x, "_sched.htm")

  ## all your other code goes here.
  pg <- read_html(url)

  # target only the <td> elements under the bodytext div
  body <- html_nodes(pg, "div#bodytext")

  # we use this new set of nodes and a relative XPath to get the initial <td> elements, then get their siblings
  es_nodes <- html_nodes(body, xpath=".//td[contains(., 'Employment Situation for')]/../td[1]")

  # clean up the cruft and make our dates!
  nfpdates <- as.Date(trimws(html_text(es_nodes)), format="%A, %B %d, %Y")
  return(nfpdates)
})

Which returns

 lst
[[1]]
 [1] "2008-01-04" "2008-02-01" "2008-03-07" "2008-04-04" "2008-05-02" "2008-06-06" "2008-07-03" "2008-08-01" "2008-09-05"
[10] "2008-10-03" "2008-11-07" "2008-12-05"

[[2]]
 [1] "2009-01-09" "2009-02-06" "2009-03-06" "2009-04-03" "2009-05-08" "2009-06-05" "2009-07-02" "2009-08-07" "2009-09-04"
[10] "2009-10-02" "2009-11-06" "2009-12-04"

## etc...
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you Symbolix! Both methods work great, and lapply is considerably faster. Yet neither records a variable with the dates for all 8 years (or 8 different variables). In both cases nfpdates only stores the last year (i.e. 2015). How could this be achieved?
@Gracos the lapply returns a list (of length 8). If you assign the lapply to a varaible you can then access all the returned results. See my update.
6

This can be done with sprintf (with no loops)

url <- sprintf("http://www.bls.gov/schedule/news_release/%d_sched.htm", 2008:2015)
url
#[1] "http://www.bls.gov/schedule/news_release/2008_sched.htm" "http://www.bls.gov/schedule/news_release/2009_sched.htm"
#[3] "http://www.bls.gov/schedule/news_release/2010_sched.htm" "http://www.bls.gov/schedule/news_release/2011_sched.htm"
#[5] "http://www.bls.gov/schedule/news_release/2012_sched.htm" "http://www.bls.gov/schedule/news_release/2013_sched.htm"
#[7] "http://www.bls.gov/schedule/news_release/2014_sched.htm" "http://www.bls.gov/schedule/news_release/2015_sched.htm"

and if we need to read the links

library(rvest)
lst <-  lapply(url, function(x) {

   pg <- read_html(x)
   body <- html_nodes(pg, "div#bodytext")
   es_nodes <- html_nodes(body, xpath=".//td[contains(., 'Employment Situation for')]/../td[1]")

   nfpdates <- as.Date(trimws(html_text(es_nodes)), format="%A, %B %d, %Y")
   nfpdates
  })

head(lst, 3)
#[[1]]
# [1] "2008-01-04" "2008-02-01" "2008-03-07" "2008-04-04" "2008-05-02" "2008-06-06" "2008-07-03" "2008-08-01"
# [9] "2008-09-05" "2008-10-03" "2008-11-07" "2008-12-05"

#[[2]]
# [1] "2009-01-09" "2009-02-06" "2009-03-06" "2009-04-03" "2009-05-08" "2009-06-05" "2009-07-02" "2009-08-07"
# [9] "2009-09-04" "2009-10-02" "2009-11-06" "2009-12-04"

#[[3]]
# [1] "2010-01-08" "2010-02-05" "2010-03-05" "2010-04-02" "2010-05-07" "2010-06-04" "2010-07-02" "2010-08-06"
# [9] "2010-09-03" "2010-10-08" "2010-11-05" "2010-12-03"

4 Comments

Thanks very much akrun, much appreciated. Your answer is almost identical to Symbolix's, I'll accept his as came in first. I'm out of upvotes for the following 20 hours.
@Gracos Yes, it is the paste vs sprintf. Otherwise, you have almost done all the groundwork
@akrun off the top of your head, do you know of any benefit to using either of sprintf or paste0?
@Symbolix I don't think there is any difference in speed, but with sprintf, you can use it on a single string, while with paste0 we are pasting multiple substrings together.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.