0

I am trying to extract urls from the website below. The tricky thing here is that the website automatically loads new pages. I did not manage to get the xpath for scraping all urls, including those on the newly loaded pages - I only manage to get the first 15 urls (of more than 70). I assume the xpath in the last line (new_results...) is missing some crucial element to account also for the pages after. Any ideas? Thank you!

# load packages
library(rvest)
library(httr)
library(RCurl)
library(XML)
library(stringr)
library(xml2)


# aim: download all speeches stored at:
# https://sheikhmohammed.ae/en-us/Speeches

# first, create vector which stores all urls to each single speech
all_links <- character() 
new_results <- "/en-us/Speeches"
signatures = system.file("CurlSSL", cainfo = "cacert.pem", package =  "RCurl") 
options(RCurlOptions = list(verbose = FALSE, capath =  system.file("CurlSSL", "cacert.pem", package = "RCurl"), ssl.verifypeer = FALSE))

while(length(new_results) > 0){ 
new_results <- str_c("https://sheikhmohammed.ae", new_results)
results <- getURL(new_results, cainfo = signatures) 
results_tree <- htmlParse(results) 
all_links <- c(all_links, xpathSApply(results_tree,"//div[@class='speech-share-board']", xmlGetAttr,"data-url"))
new_results <- xpathSApply(results_tree,"//div[@class='speech-share-board']//after",xmlGetAttr, "data-url")}

# or, alternatively with phantomjs (also here, it loads only first 15 urls):
url <- "https://sheikhmohammed.ae/en-us/Speeches#"

# write out a script phantomjs can process
writeLines(sprintf("var page = require('webpage').create();
               page.open('%s', function () {
               console.log(page.content); //page source
               phantom.exit();
               });", url), con="scrape.js")

# process it with phantomjs
write(readLines(pipe("phantomjs scrape.js", "r")), "scrape.html")
6
  • 1
    I'm not going to hit that website from my servers, but a few options are to simply get ALL the website links by using the xpath '//*[@href]' and then grep('^\\/?en-us\\/',x,perl = TRUE, value = TRUE) and then xml2::url_absolute. Or, if the site is dynamic, you'll need to us selenium or phantomjs webdriver. Alternatively go to the site, inspect element by right click, and look at the physical html markup to find what you need and build a function out of that. Commented Apr 17, 2017 at 15:32
  • @Carl Boneri: Thank you for these hints, I tried all but did not succeed so far. I added my attempts of using phantomjs, yet, also here it only loads the first 15 urls... can you see what is missing in the code? (the website is the official website of the government of the United Arab Emirates) Commented Apr 17, 2017 at 16:36
  • find a phantom js example online where the output it sent to an html file; and go from there. otherwise everything looks correct. alternatively perhaps get //comment () nodes and see if links are inside those Commented Apr 17, 2017 at 16:43
  • There appear to only be 15 speeches on that website with a unique SpeechID href? Commented Apr 17, 2017 at 17:32
  • @Carl Boneri: As per the physical html markup, each of the speeches has its own speechID. Yet, what happens is that the code gets stuck in the loop of extracting the first 15 speechIDs only... if not stopped manually, it keeps repeating these 15 IDs. I guess I miss the correct xpath for the lazy-loading mechanism. Commented Apr 17, 2017 at 22:26

1 Answer 1

1

Running the Javascript for lazy loading in RSelenium or Selenium in Python would be the most elegant approach to solve the problem. Yet, as a less elegant but faster alternative, one can manually change the settings of the json query in the firefox development modus/network feature to load not only 15 but more (=all) speeches at once. This worked fine for me and I was able to extract all the links via the json response.

Sign up to request clarification or add additional context in comments.

2 Comments

Sounds like a useful approach. Could you explain how exactly you do this in firefox (or another browser)?
@Steve G.Jones, for firefox and for this particular website (might be different in other cases, though): go to website, open the inspection tool, go to the network analysis feature, scroll now the lazy-loading page to see the json query starting, wait until loaded, then click to edit and re-send the query, edit the request body, scroll to <row limit>, change from 15 to 100 or more, re-send the query and get links of all speeches via the json response (for this, go to 4th field after header, cookies, parameter). Extract all links with the usual tools.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.