1

I do have a problem concerning the scraping of information from a certain xml-document (http://www.bundestag.de/xml/mdb/index.xml).

<mdbUebersicht>
<dokumentInfo>
<dokumentURL/>
<dokumentStand/>
</dokumentInfo>
<deleteRestore>
<deleteFlag>0</deleteFlag>
<deleteDate>20131202170000</deleteDate>
</deleteRestore>
<mdbs>
<mdb fraktion="Die Linke">
<mdbID status="Aktiv">1627</mdbID>
<mdbName status="Aktiv">Aken, Jan van</mdbName>
<mdbBioURL>
http://www.bundestag.de/abgeordnete18/biografien/A/aken_jan/258124
</mdbBioURL>
<mdbInfoXMLURL>
http://www.bundestag.de/xml/mdb/biografien/A/aken_jan.xml
</mdbInfoXMLURL>
<mdbInfoXMLURLMitmischen>/biografien/A/aken_jan.xml</mdbInfoXMLURLMitmischen>
<mdbLand>Hamburg</mdbLand>
<mdbFotoURL>
http://www.bundestag.de/blueprint/servlet/image/240714/Hochformat__2x3/177/265/83abda4f387877a2b5eeedbfd81e8eba/Yc/aken_jan_gross.jpg
</mdbFotoURL>
<mdbFotoGrossURL>
http://www.bundestag.de/blueprint/servlet/image/240714/Hochformat__2x3/316/475/83abda4f387877a2b5eeedbfd81e8eba/Uq/aken_jan_gross.jpg
</mdbFotoGrossURL>
<mdbFotoLastChanged>24.10.2016</mdbFotoLastChanged>
<mdbFotoChangedDateTime>24.10.2016 12:17</mdbFotoChangedDateTime>
<lastChanged>30.09.2016</lastChanged>
<changedDateTime>30.09.2016 12:38</changedDateTime>
</mdb>

The document contains a lot of short biographical aspects of different persons. Among other things it contains urls to other xml documents which contains a more detailed biography.

I try the following to get the information:

First I try to get all URLs for the different sub-documents from the maindocument

mdb_url <- xml_text(xml_find_all(xmlDocu, "//mdbInfoXMLURL"))

Then I implemented a for-loop which download all xml in my directory

for (url in mdb_url) {
  download.file(url, destfile = basename(url))
}

Afterwards I want to received a list of the files...

files <- list.files(pattern = ".xml")

... to get a specific node of every xml doc:

Bio1 <- files[1]

xmlfile <- read_xml(Bio1)

mdb_ausschuss1 <- xml_text(xml_find_all(xmlfile, "//gremiumName"))

Now I have the problem how I can do it for all xml files in the list? I haven't been able to write a functional loop or script for that task...

1 Answer 1

1
library(xml2)
library(httr)
library(rvest)
library(tools)
library(tidyverse)

Get the URL list from the main site XML

URL <- "http://www.bundestag.de/xml/mdb/index.xml"
doc <- read_xml(URL)
xml_find_all(doc, "//mdbInfoXMLURL") %>% xml_text() -> mdb_urls

Create a place to store them:

dir.create("docs")

Write them to disk (I’m only grabbing 10 of them since I don’t need the data, you do :-)

Note that write_disk() will not overwrite the path unless told to, so this is a great way to do poor-man’s caching. If you place this in a reproducible script, you'll have to try/catch wrap it.

walk(mdb_urls[1:10], ~GET(., write_disk(file.path("docs", basename(.)))))

Get the file list:

fils <- list.files("docs", pattern=".*.xml", full.names=TRUE)

Turn it into a data frame:

pb <- progress_estimated(length(fils)) # use a progress bar
map_df(fils, function(x) {

  pb$tick()$print() # increment the progress bar

  gremium_doc <- read_xml(x) # read in the file

  # find all the `gremiumName`s. If there are none, make the value `NA`
  xml_find_all(gremium_doc, "//gremiumName") %>% xml_text() -> g_names
  if (length(g_names) == 0) g_names <- NA_character_

  # make a tidy data frame
  data_frame(gremium=file_path_sans_ext(basename(x)), name=g_names)

}) -> df

Prove it works

glimpse(df)
## Observations: 33
## Variables: 2
## $ gremium <chr> "aken_jan", "aken_jan", "aken_jan", "aken_jan", "alban...
## $ name    <chr> "Auswärtiger Ausschuss", "Gremium nach § 23c Absatz 8 ...
Sign up to request clarification or add additional context in comments.

2 Comments

Maybe you could help me a last time: How could I turn the data-structure easy from long to wide in that case? In that sense that for every new gremium there is a new column with the attribute ("Auswärtiger Ausschuss", "Wirtschafts Ausschuss" etc.) "true" or "not". I tried the resphape command but I don't get there.
totally separate question (SO rules/guidelines suggest that shld be a new q)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.