Scraping a xml document (nested url-structure)

Question

I do have a problem concerning the scraping of information from a certain xml-document (http://www.bundestag.de/xml/mdb/index.xml).

<mdbUebersicht>
<dokumentInfo>
<dokumentURL/>
<dokumentStand/>
</dokumentInfo>
<deleteRestore>
<deleteFlag>0</deleteFlag>
<deleteDate>20131202170000</deleteDate>
</deleteRestore>
<mdbs>
<mdb fraktion="Die Linke">
<mdbID status="Aktiv">1627</mdbID>
<mdbName status="Aktiv">Aken, Jan van</mdbName>
<mdbBioURL>
http://www.bundestag.de/abgeordnete18/biografien/A/aken_jan/258124
</mdbBioURL>
<mdbInfoXMLURL>
http://www.bundestag.de/xml/mdb/biografien/A/aken_jan.xml
</mdbInfoXMLURL>
<mdbInfoXMLURLMitmischen>/biografien/A/aken_jan.xml</mdbInfoXMLURLMitmischen>
<mdbLand>Hamburg</mdbLand>
<mdbFotoURL>
http://www.bundestag.de/blueprint/servlet/image/240714/Hochformat__2x3/177/265/83abda4f387877a2b5eeedbfd81e8eba/Yc/aken_jan_gross.jpg
</mdbFotoURL>
<mdbFotoGrossURL>
http://www.bundestag.de/blueprint/servlet/image/240714/Hochformat__2x3/316/475/83abda4f387877a2b5eeedbfd81e8eba/Uq/aken_jan_gross.jpg
</mdbFotoGrossURL>
<mdbFotoLastChanged>24.10.2016</mdbFotoLastChanged>
<mdbFotoChangedDateTime>24.10.2016 12:17</mdbFotoChangedDateTime>
<lastChanged>30.09.2016</lastChanged>
<changedDateTime>30.09.2016 12:38</changedDateTime>
</mdb>

The document contains a lot of short biographical aspects of different persons. Among other things it contains urls to other xml documents which contains a more detailed biography.

I try the following to get the information:

First I try to get all URLs for the different sub-documents from the maindocument

mdb_url <- xml_text(xml_find_all(xmlDocu, "//mdbInfoXMLURL"))

Then I implemented a for-loop which download all xml in my directory

for (url in mdb_url) {
  download.file(url, destfile = basename(url))
}

Afterwards I want to received a list of the files...

files <- list.files(pattern = ".xml")

... to get a specific node of every xml doc:

Bio1 <- files[1]

xmlfile <- read_xml(Bio1)

mdb_ausschuss1 <- xml_text(xml_find_all(xmlfile, "//gremiumName"))

Now I have the problem how I can do it for all xml files in the list? I haven't been able to write a functional loop or script for that task...

hrbrmstr · Accepted Answer · 2017-01-16 15:34:46Z

1

library(xml2)
library(httr)
library(rvest)
library(tools)
library(tidyverse)

Get the URL list from the main site XML

URL <- "http://www.bundestag.de/xml/mdb/index.xml"
doc <- read_xml(URL)
xml_find_all(doc, "//mdbInfoXMLURL") %>% xml_text() -> mdb_urls

Create a place to store them:

dir.create("docs")

Write them to disk (I’m only grabbing 10 of them since I don’t need the data, you do :-)

Note that write_disk() will not overwrite the path unless told to, so this is a great way to do poor-man’s caching. If you place this in a reproducible script, you'll have to try/catch wrap it.

walk(mdb_urls[1:10], ~GET(., write_disk(file.path("docs", basename(.)))))

Get the file list:

fils <- list.files("docs", pattern=".*.xml", full.names=TRUE)

Turn it into a data frame:

pb <- progress_estimated(length(fils)) # use a progress bar
map_df(fils, function(x) {

  pb$tick()$print() # increment the progress bar

  gremium_doc <- read_xml(x) # read in the file

  # find all the `gremiumName`s. If there are none, make the value `NA`
  xml_find_all(gremium_doc, "//gremiumName") %>% xml_text() -> g_names
  if (length(g_names) == 0) g_names <- NA_character_

  # make a tidy data frame
  data_frame(gremium=file_path_sans_ext(basename(x)), name=g_names)

}) -> df

Prove it works

glimpse(df)
## Observations: 33
## Variables: 2
## $ gremium <chr> "aken_jan", "aken_jan", "aken_jan", "aken_jan", "alban...
## $ name    <chr> "Auswärtiger Ausschuss", "Gremium nach § 23c Absatz 8 ...

answered Jan 16, 2017 at 15:34

hrbrmstr

79.1k11 gold badges146 silver badges209 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Peter Noah Over a year ago

Maybe you could help me a last time: How could I turn the data-structure easy from long to wide in that case? In that sense that for every new gremium there is a new column with the attribute ("Auswärtiger Ausschuss", "Wirtschafts Ausschuss" etc.) "true" or "not". I tried the resphape command but I don't get there.

hrbrmstr Over a year ago

totally separate question (SO rules/guidelines suggest that shld be a new q)

Collectives™ on Stack Overflow

Scraping a xml document (nested url-structure)

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related