Getting uneven elements in XML file into dataframe

Question

I have been trying to get an XML file into a dataframe but am struggling, I have tried a few approaches and this is where I am at.

My XML file looks like 20k segments of this:

<?xml version="1.0"?>
<data experimentId="5244" savingTime="2018-01-06T14:25:48-0500" eventType="Workflow" userId="303">
  <root>
    <set id="ASSAY_WORKFLOW">
      <row state="MODIFIED" pk="5905_Standard_Validation_Standard_Validation">
        <field name="ASSAY_ID">5244</field>
        <field name="WORKFLOW_ID">5905_Standard_Validation_Standard_Validation</field>
        <field name="WORKFLOW_STATE">0</field>
        <field name="ASSAY_WORKFLOW_STATE">InDelegation</field>
        <field name="WORKFLOW_LAST_STEP_ID">17896</field>
      </row>
    </set>
    <set id="WORKFLOW_STEPS">
      <row state="NEW" pk="17896">
        <field name="STEP_ID">17896</field>
        <field name="WORKFLOW_ID">5905_Standard_Validation_Standard_Validation</field>
        <field name="STEP_DATE">2018-01-06T14:25:45-0500</field>
        <field name="STEP_DATE_TZ">America/New_York</field>
        <field name="USER_ID">303</field>
        <field name="USER_FULL_NAME">Ron Swanson</field>
        <field name="NEW_WORKFLOW_ASSAY_STATE">InDelegation</field>
        <field name="FORMER_WORKFLOW_ASSAY_STATE">Draft</field>
        <field name="ROLE_ID">1</field>
      </row>
    </set>
    <set id="WORKFLOW_STEP_VARIABLES">
      <row state="NEW" pk="17896&#xA4;nextActorId">
        <field name="STEP_ID">17896</field>
        <field name="VARIABLE_ID">nextActorId</field>
        <field name="VALUE">2</field>
      </row>
      <row state="NEW" pk="17896&#xA4;validateToPendingValidation">
        <field name="STEP_ID">17896</field>
        <field name="VARIABLE_ID">validateToPendingValidation</field>
        <field name="VALUE">false</field>
      </row>
      <row state="NEW" pk="17896&#xA4;signToPendingSignature">
        <field name="STEP_ID">17896</field>
        <field name="VARIABLE_ID">signToPendingSignature</field>
        <field name="VALUE">false</field>
      </row>
      <row state="NEW" pk="17896&#xA4;comment">
        <field name="STEP_ID">17896</field>
        <field name="VARIABLE_ID">comment</field>
        <field name="VALUE">GH-VAP, IgG1 repeats,</field>
      </row>
      <row state="NEW" pk="17896&#xA4;actionDelegateU">
        <field name="STEP_ID">17896</field>
        <field name="VARIABLE_ID">actionDelegateU</field>
        <field name="VALUE">directDelegateU</field>
      </row>
    </set>
    <set id="WORKFLOW_ROLE_NAMES">
      <row state="NEW" pk="1">
        <field name="ROLE_ID">1</field>
        <field name="LANGUAGE_ID">2</field>
        <field name="DESCRIPTION">Author</field>
      </row>
    </set>
  </root>
</data>

For each root node there are child elements that have the same tag "field" with attribute "name". The values of which identify the value and name of the column I want it under in my data frame.

I can get everything out with this:

library(XML)

xmlfilealt <- xmlParse("data/eln_audit_workflow.xml")

username <- xpathSApply(xmlfilealt, "//field[@name='USER_FULL_NAME']", xmlValue)
title <- xpathSApply(xmlfilealt, "//field[@name='VALUE']", xmlValue)
state <- xpathSApply(xmlfilealt, "//field[@name='ASSAY_WORKFLOW_STATE']", xmlValue)
actionDate <- xpathSApply(xmlfilealt, "//field[@name='STEP_DATE']", xmlValue)
actor <- xpathSApply(xmlfilealt, "//field[@name='DESCRIPTION']", xmlValue)

I planned to create a data.frame with them but the vectors are all slighty different lengths which I assume is because there are probably some missing elements in some of the root nodes. Can someone clue me in on how to handle this?

Thanks

Parfait · Accepted Answer · 2018-02-09 16:19:03Z

2

For children elements that may or may not be present, consider iterating across the parent node, here being <row>, by node position. Then, build lists of dataframes casting each column to the needed value or zero length string using XPath's concat to always return a result for equal length columns. Finally rbind all dataframes of list for final.

row_length <- length(xpathSApply(xmlfilealt, "//row"))

df_List <- lapply(seq(row_length), function(i){

  data.frame(
    username = xpathSApply(xmlfilealt, sprintf("concat(//row[%s]/field[@name='USER_FULL_NAME'],'')", i), xmlValue),
    title = xpathSApply(xmlfilealt, sprintf("concat(//row[%s]/field[@name='VALUE'],'')", i), xmlValue),
    state = xpathSApply(xmlfilealt, sprintf("concat(//row[%s]/field[@name='ASSAY_WORKFLOW_STATE'],'')", i), xmlValue),
    actionDate = xpathSApply(xmlfilealt, sprintf("concat(//row[%s]/field[@name='STEP_DATE'],'')", i), xmlValue),
    actor = xpathSApply(xmlfilealt, sprintf("concat(//row[%s]/field[@name='DESCRIPTION'],'')", i), xmlValue),
    stringsAsFactors = FALSE
  )

})

# CONCATENATE ALL DFs 
finaldf <- do.call(rbind, df_List)

# CONVERT ZERO-LENGTH STRINGS TO NA
finaldf[] <- sapply(finaldf, function(col) ifelse(col=='', NA, col))

finaldf
#      username                 title        state               actionDate  actor
# 1 Ron Swanson                     2 InDelegation 2018-01-06T14:25:45-0500 Author
# 2        <NA>                 false         <NA>                     <NA>   <NA>
# 3        <NA>                 false         <NA>                     <NA>   <NA>
# 4        <NA> GH-VAP, IgG1 repeats,         <NA>                     <NA>   <NA>
# 5        <NA>       directDelegateU         <NA>                     <NA>   <NA>
# 6        <NA>                  <NA>         <NA>                     <NA>   <NA>
# 7        <NA>                  <NA>         <NA>                     <NA>   <NA>
# 8        <NA>                  <NA>         <NA>                     <NA>   <NA>

answered Feb 9, 2018 at 16:19

Parfait

108k19 gold badges103 silver badges138 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

r3vdev Over a year ago

Thanks Parfait! Your answer is easier to understand and tought me some new things, but I choose Dave2e answer because I started your code at 9PM, went to bed at 11PM, and it finished some time between then and 8AM :P! So it worked great but it took a while. What do you use to reformat the XML by the way? I tried using notepad++.

Parfait Over a year ago

Very interesting this took hours and xml2 took minutes! Had you gone with xml2 I would have advised transforming your XML into simpler format using XSLT and then import into dataframe with xml2's extension, xslt .

Dave2e · Accepted Answer · 2018-02-09 15:49:18Z

1

This xml is very inconsistent and difficult to parse out in a consistent manner. I prefer to use the xml2 package, since I find the syntax easier to use.

library(xml2)

# parse all of the root nodes into separate nodes
rootnodes<-xml_find_all(page, "root")

# read the desired fields from each individual root nodes
a<-sapply(rootnodes, function(xnode) { xml_text(xml_find_first(xnode, "set/row/field[@name='ASSAY_WORKFLOW_STATE']"))})
b<-sapply(rootnodes, function(xnode) { xml_text(xml_find_first(xnode, "set/row/field[@name='STEP_DATE']"))})
c<-sapply(rootnodes, function(xnode) { xml_text(xml_find_first(xnode, "set/row/field[@name='USER_FULL_NAME']"))})
d<-sapply(rootnodes, function(xnode) { xml_text(xml_find_first(xnode, "set/row/field[@name='DESCRIPTION']"))})

#Create the desired output
df=data.frame(assaystate = a, stepdate=b, name = c, description = d)

The advantage of this approach is the expectation of each root node should contain each desired field. xml2's xml_find_first will return NA if that field/node is missing thus all the results will be the same size.

I left out the value field since there were at least fields with the value attribute and it was not clear if one or all of those values were required.

answered Feb 9, 2018 at 15:49

Dave2e

24.3k18 gold badges46 silver badges57 bronze badges

1 Comment

r3vdev Over a year ago

"This xml is very inconsistent and difficult to parse out in a consistent manner." Well that makes me feel better :P. Thanks Dave2e, this worked and only took a couple minutes to run.

Collectives™ on Stack Overflow

Getting uneven elements in XML file into dataframe

2 Answers 2

2 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related