1

Please help me.

I am new to web scraping in R. I want to collect the link download data tables on this page (http://burkinafaso.opendataforafrica.org/). My project is to make these data more accessible.

Here is the website : http://burkinafaso.opendataforafrica.org/

In the page Donnée I have a list of the sectors. Agriculture: 43 tables Public Help: 7 tables ...

When I click on Agriculture I get the dataset list. https://drive.google.com/open?id=1cInWz62HjbcpgJ00rK-8Q-0p71mC59hq

  1. I want to get the list of these titles.
  2. For each title get the download link of the dataset.

I tried this code below to see the structure of the site. But I do not see the architecture that can allow me to extract these links.

library(RCurl)
library(XML)
library(rvest)
URL <- "http://burkinafaso.opendataforafrica.org/"
pg <- read_html(URL)
p <- html_children(pg)[1]
pp <- html_children(pg)[2]
html_structure(p)
html_structure(pp)
library(RCurl)
library(XML)
library(rvest)
URL <- "http://burkinafaso.opendataforafrica.org/data/#topic=Agriculture"
pg <- read_html(URL)
p <- html_children(pg)[1]
pp <- html_children(pg)[2]
html_structure(p)
html_structure(pp)

For example, I tried this code for links in tags. But I do not get the differents download links.

URL <- "http://burkinafaso.opendataforafrica.org/data/#topic=Agriculture"
pg <- read_html(URL)
all.url <- html_attr(html_nodes(pg, "a"), "href")
all.url <- as.data.frame(all.url)

As results I expect, For each itm the list of tables and download links. For example:

for Public Aid (7):

label links

Aide extérieure par secteur de 1995 à 2006 (en millions de FCFA) download links Aide extérieure par type (en millions de FCFA) download links

Please help me.

1 Answer 1

2

Web traffic and API call:

So, if you start ,for example, with

http://burkinafaso.opendataforafrica.org/data/#menu=topic

You can see the list of all the top level links along with the counts of their datasets. If we were to click on Aide Publique (7) we would then see the 7 sections which, if you click on any, then present you with Select dataset.

If you monitor the web-traffic when doing that first click you will see the API POST request made to retrieve the data for Aide Publique (7):

enter image description here

If we further inspect the request we can observe the query string params in the url and the request payload:

enter image description here

The params is basically some info about us we can probably remove; and a little experimentation with the payload shows, if we exclude the payload, we actually get all the topics and not just Aide Publique (7).


The API response:

Now, the response is json and is an array but in R that means a list. Looking at part of one item in the list as an example:

A comparison of that info versus the actual dataset links e.g.

enter image description here

Clicking on the Select DataSet yields the end url of

http://burkinafaso.opendataforafrica.org/nthpfqd/aide-ext%C3%A9rieure-par-secteur-de-1995-%C3%A0-2006-en-millions-de-fcfa

A quick comparison with list item 1 (prior image of json) shows that this if we consider this new url decoded:

http://burkinafaso.opendataforafrica.org/nthpfqd/aide-extérieure-par-secteur-de-1995-à-2006-en-millions-de-fcfa

is of the format:

'http://burkinafaso.opendataforafrica.org/{item["id"]}/{item["title"]}'

meaning, in a loop over the json response object we can generate the final links by concatenating a base string with the current item id and current item title. We can also extract the title from the current item name. We can use purrr and map_df to handle the loop and final dataframe generation and httr to make the POST.


R:

library(httr)
library(purrr)

r <- content(POST("http://burkinafaso.opendataforafrica.org/api/1.0/meta/dataset"))

df <- map_df(r, function(item) {

  data.frame(title = item$name,
             url = paste0("http://burkinafaso.opendataforafrica.org/", item$id,'/',item$title),
             stringsAsFactors=FALSE)
})

View(df)

Py:

import requests
import pandas as pd

r = requests.post('http://burkinafaso.opendataforafrica.org/api/1.0/meta/dataset').json()
df = pd.DataFrame([(item['name'], f'http://burkinafaso.opendataforafrica.org/{item["id"]}/{item["title"]}') for item in r]
                  ,columns = ['Title','Url'])
print(df)
Sign up to request clarification or add additional context in comments.

9 Comments

Thank you very much for your quick response @QHarr. It helps me a lot already. And that answers my question. Now how can I download my data. I did a function to download the first link. But he sends me the code html (xml). Here is the first link that your script retrieves : link <- 'http://burkinafaso.opendataforafrica.org/ajcalpd/accidents-de-la-circulation-constatés-par-la-gendarmerie-nationale'
I made this code to download it : downloadImages <- function(files, name, outPath){ for(i in 1:length(files)){ download.file(files, destfile = paste0(outPath, "/", name, "_", i, ".csv"), mode = 'wb') } } outPath="E:/opendata" name="data" downloadImages(link,name, outPath)
The links I am extracting I don't think initiates a download. I don't see anything that would. Can you tell me for 'burkinafaso.opendataforafrica.org/ajcalpd/…' what the download link is you were expecting?
I want the download link of this dataset : drive.google.com/open?id=10dD8RCHQjR6gBPZeCsktPb3vr-DL1V2- when you follow the link, on the right you have a button that allows you to download the table : drive.google.com/open?id=1gvg2O49Dv0qYLvHF6a0OOz4THdVfK2HD
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.