Scrape a data links of a page with JavaScript from R

Question

Please help me.

I am new to web scraping in R. I want to collect the link download data tables on this page (http://burkinafaso.opendataforafrica.org/). My project is to make these data more accessible.

Here is the website : http://burkinafaso.opendataforafrica.org/

In the page Donnée I have a list of the sectors. Agriculture: 43 tables Public Help: 7 tables ...

When I click on Agriculture I get the dataset list. https://drive.google.com/open?id=1cInWz62HjbcpgJ00rK-8Q-0p71mC59hq

I want to get the list of these titles.
For each title get the download link of the dataset.

I tried this code below to see the structure of the site. But I do not see the architecture that can allow me to extract these links.

library(RCurl)
library(XML)
library(rvest)
URL <- "http://burkinafaso.opendataforafrica.org/"
pg <- read_html(URL)
p <- html_children(pg)[1]
pp <- html_children(pg)[2]
html_structure(p)
html_structure(pp)

library(RCurl)
library(XML)
library(rvest)
URL <- "http://burkinafaso.opendataforafrica.org/data/#topic=Agriculture"
pg <- read_html(URL)
p <- html_children(pg)[1]
pp <- html_children(pg)[2]
html_structure(p)
html_structure(pp)

For example, I tried this code for links in tags. But I do not get the differents download links.

URL <- "http://burkinafaso.opendataforafrica.org/data/#topic=Agriculture"
pg <- read_html(URL)
all.url <- html_attr(html_nodes(pg, "a"), "href")
all.url <- as.data.frame(all.url)

As results I expect, For each itm the list of tables and download links. For example:

for Public Aid (7):

label links

Aide extérieure par secteur de 1995 à 2006 (en millions de FCFA) download links Aide extérieure par type (en millions de FCFA) download links

Please help me.

QHarr · Accepted Answer · 2019-09-20 19:41:06Z

2

Web traffic and API call:

So, if you start ,for example, with

http://burkinafaso.opendataforafrica.org/data/#menu=topic

You can see the list of all the top level links along with the counts of their datasets. If we were to click on Aide Publique (7) we would then see the 7 sections which, if you click on any, then present you with Select dataset.

If you monitor the web-traffic when doing that first click you will see the API POST request made to retrieve the data for Aide Publique (7):

If we further inspect the request we can observe the query string params in the url and the request payload:

The params is basically some info about us we can probably remove; and a little experimentation with the payload shows, if we exclude the payload, we actually get all the topics and not just Aide Publique (7).

The API response:

Now, the response is json and is an array but in R that means a list. Looking at part of one item in the list as an example:

A comparison of that info versus the actual dataset links e.g.

Clicking on the Select DataSet yields the end url of

http://burkinafaso.opendataforafrica.org/nthpfqd/aide-ext%C3%A9rieure-par-secteur-de-1995-%C3%A0-2006-en-millions-de-fcfa

A quick comparison with list item 1 (prior image of json) shows that this if we consider this new url decoded:

http://burkinafaso.opendataforafrica.org/nthpfqd/aide-extérieure-par-secteur-de-1995-à-2006-en-millions-de-fcfa

is of the format:

'http://burkinafaso.opendataforafrica.org/{item["id"]}/{item["title"]}'

meaning, in a loop over the json response object we can generate the final links by concatenating a base string with the current item id and current item title. We can also extract the title from the current item name. We can use purrr and map_df to handle the loop and final dataframe generation and httr to make the POST.

R:

library(httr)
library(purrr)

r <- content(POST("http://burkinafaso.opendataforafrica.org/api/1.0/meta/dataset"))

df <- map_df(r, function(item) {

  data.frame(title = item$name,
             url = paste0("http://burkinafaso.opendataforafrica.org/", item$id,'/',item$title),
             stringsAsFactors=FALSE)
})

View(df)

Py:

import requests
import pandas as pd

r = requests.post('http://burkinafaso.opendataforafrica.org/api/1.0/meta/dataset').json()
df = pd.DataFrame([(item['name'], f'http://burkinafaso.opendataforafrica.org/{item["id"]}/{item["title"]}') for item in r]
                  ,columns = ['Title','Url'])
print(df)

edited Sep 20, 2019 at 19:41

answered Sep 20, 2019 at 17:13

QHarr

84.5k14 gold badges58 silver badges105 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

Armel Soubeiga Over a year ago

Thank you very much for your quick response @QHarr. It helps me a lot already. And that answers my question. Now how can I download my data. I did a function to download the first link. But he sends me the code html (xml). Here is the first link that your script retrieves :

link <- 'http://burkinafaso.opendataforafrica.org/ajcalpd/accidents-de-la-circulation-constatés-par-la-gendarmerie-nationale'

Armel Soubeiga Over a year ago

I made this code to download it :

downloadImages <- function(files, name, outPath){   for(i in 1:length(files)){     download.file(files, destfile = paste0(outPath, "/", name, "_", i, ".csv"), mode = 'wb')   }    } outPath="E:/opendata" name="data" downloadImages(link,name, outPath)

QHarr Over a year ago

The links I am extracting I don't think initiates a download. I don't see anything that would. Can you tell me for 'burkinafaso.opendataforafrica.org/ajcalpd/…' what the download link is you were expecting?

Armel Soubeiga Over a year ago

I want the download link of this dataset : drive.google.com/open?id=10dD8RCHQjR6gBPZeCsktPb3vr-DL1V2- when you follow the link, on the right you have a button that allows you to download the table : drive.google.com/open?id=1gvg2O49Dv0qYLvHF6a0OOz4THdVfK2HD

QHarr Over a year ago

For this page burkinafaso.opendataforafrica.org/ajcalpd/… this is what I see: tmpfiles.org/download/26709/Untitled.png

|

Collectives™ on Stack Overflow

Scrape a data links of a page with JavaScript from R

1 Answer 1

9 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

9 Comments

Your Answer

Sign up or log in

Post as a guest

Related