I am trying to extract urls from the website below. The tricky thing here is that the website automatically loads new pages. I did not manage to get the xpath for scraping all urls, including those on the newly loaded pages - I only manage to get the first 15 urls (of more than 70). I assume the xpath in the last line (new_results...) is missing some crucial element to account also for the pages after. Any ideas? Thank you!
# load packages
library(rvest)
library(httr)
library(RCurl)
library(XML)
library(stringr)
library(xml2)
# aim: download all speeches stored at:
# https://sheikhmohammed.ae/en-us/Speeches
# first, create vector which stores all urls to each single speech
all_links <- character()
new_results <- "/en-us/Speeches"
signatures = system.file("CurlSSL", cainfo = "cacert.pem", package = "RCurl")
options(RCurlOptions = list(verbose = FALSE, capath = system.file("CurlSSL", "cacert.pem", package = "RCurl"), ssl.verifypeer = FALSE))
while(length(new_results) > 0){
new_results <- str_c("https://sheikhmohammed.ae", new_results)
results <- getURL(new_results, cainfo = signatures)
results_tree <- htmlParse(results)
all_links <- c(all_links, xpathSApply(results_tree,"//div[@class='speech-share-board']", xmlGetAttr,"data-url"))
new_results <- xpathSApply(results_tree,"//div[@class='speech-share-board']//after",xmlGetAttr, "data-url")}
# or, alternatively with phantomjs (also here, it loads only first 15 urls):
url <- "https://sheikhmohammed.ae/en-us/Speeches#"
# write out a script phantomjs can process
writeLines(sprintf("var page = require('webpage').create();
page.open('%s', function () {
console.log(page.content); //page source
phantom.exit();
});", url), con="scrape.js")
# process it with phantomjs
write(readLines(pipe("phantomjs scrape.js", "r")), "scrape.html")
grep('^\\/?en-us\\/',x,perl = TRUE, value = TRUE)and thenxml2::url_absolute. Or, if the site is dynamic, you'll need to us selenium or phantomjs webdriver. Alternatively go to the site, inspect element by right click, and look at the physical html markup to find what you need and build a function out of that.//comment ()nodes and see if links are inside thoseSpeechIDhref?