xml in r: extracting xml values of node sets

Question

I am trying to extract certain xml values out of a (pretty large) document. Because I am only interested in some nodes, I created subsets.

library(XML)
data.raw <- xmlParse(file="in/data.xml", encoding="UTF-8")
data.top <- xmlRoot(data.raw)
subset.wkr67 <-  getNodeSet(doc=data.top, "//wahl[@jahr='13']/gebiet[@schluessel='67']/wvt")

The last object looks like this (fyi, these are election results with absolute vote counts for certain districts):

[[1]]
<wvt kurz="CDU" lang="Christlich Demokratische Union Deutschlands in Niedersachsen" button="CDU">
    <ergebnis kurz="STWVT" lang="Zweitstimmen">
        <stimmen>21478</stimmen>
        <farbe>#0033CC</farbe>
        <prozent>57.6</prozent>
    </ergebnis>
    <ergebnis kurz="STKAND" lang="Erststimmen">
        <stimmen>25835</stimmen>
        <farbe>#0033CC</farbe>
        <prozent>69.4</prozent>
    </ergebnis>
</wvt>

[[2]]
...   

attr(,"class")
[1] "XMLNodeSet"

I want to extract the absolute vote count in the different tiers; they should be saved in separate objects. As far as I get, this should be possible with xmlValue and sapply.

In order to extract the value of the "stimmen" element that is a sibling of the element "ergebnis" with the attribute "kurz"="STWVT" (in my example: 21478), I was trying to do this:

sapply(subset.wkr67, xmlValue, '/wvt/ergebnis[@kurz="STWVT"]/stimmen') 
[1] "21478#0033CC57.625835#0033CC69.4" "6640#FFDFDF17.86308#FFDFDF17.0"   "4682#99990012.61410#FFFF993.8"    "2663#CCFFCC7.11888#CCFFCC5.1"    
[5] "708#C979E31.9848#B953EC2.3"       "220.1"                            "3731.0"                           "830.2"                           
[9] "2140.6"                           "1520.4"                           "1220.3"                           "542#F5A5541.5541#F5A5541.5"      
[13] "593#ECF0EC1.6373#ECF0EC1.0"

I somehow extract far too many information. (Each element is basically the values of ALL elements pasted together. The length of 13 is okay and fits the data.) (If I further add the option "recursive=FALSE" to the R command, my results are a vector of the same length that contains only characters.)

How can I extract only the first value of the "stimmen" element? (21478 in my case) Thanks for your help!

Dieter Menne · Accepted Answer · 2014-03-12 11:09:27Z

3

Assuming you only have the shown data in the xml file (with header), try this:

library(XML)
doc = xmlParseDoc("wahl.xml")
xpathSApply(doc,"/wvt/ergebnis",xmlAttrs) 
xpathSApply(doc,"/wvt/ergebnis/stimmen",xmlValue)

Some conversion to data frame should follow to get descriptors for each vote set.

answered Mar 12, 2014 at 11:09

Dieter Menne

10.3k48 silver badges69 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

xml in r: extracting xml values of node sets

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related