2

I am trying to parse an XML file nodes and attributes. Within the file there is a set of nodes with attributes. Nested XML structure is similar to a data frame with a I want to parse this into a data frame.

Here is an example file:

<?xml version="1.0" encoding="UTF-8"?>
<TrackMate version="3.8.0">
  <Model spatialunits="µm" timeunits="sec">
    <AllTracks>
      <Track name="Track_2" TRACK_ID="2" NUMBER_SPOTS="140" NUMBER_GAPS="0" >
        <Edge SPOT_SOURCE_ID="960769" SPOT_TARGET_ID="960778" LINK_COST="0.08756957830926632" />
        <Edge SPOT_SOURCE_ID="958304" SPOT_TARGET_ID="958308" LINK_COST="1.4003359672950089" />
        <Edge SPOT_SOURCE_ID="958316" SPOT_TARGET_ID="958322" LINK_COST="1.6985623204008202" />
      </Track>
      <Track name="Track_145" TRACK_ID="145" NUMBER_SPOTS="141" NUMBER_GAPS="0" >
        <Edge SPOT_SOURCE_ID="961623" SPOT_TARGET_ID="961628" LINK_COST="2.2678642015413755" />
        <Edge SPOT_SOURCE_ID="962122" SPOT_TARGET_ID="962127" LINK_COST="38.20777704254654" />
        <Edge SPOT_SOURCE_ID="961869" SPOT_TARGET_ID="961873" LINK_COST="0.2895609647324684" />
      </Track>
    </AllTracks>
  </Model>
</TrackMate>

I would like like create a data frame with all attributes of edges and parent's TRACK_ID attribute. I can readily create the data frame with all the edges' attributes with this:

edges = data.frame(t(data.frame(xml_attrs(xml_find_all(xmlDoc, xpath = paste0('/TrackMate/Model/AllTracks//Edge'))))))
row.names(edges) = NULL

But then the corresponding track ID is lost. I can solve this with a for loop but that is often not the "R way". I was wondering if, there are is a simpler solution? (e.g. with xpath query).

So the final desired output would be this data frame: output data frame

Edit: this comes closer but the then Track nodes and Edge nodes are mixed within a list.

xml_find_all(xmlDoc, xpath = paste0('/TrackMate/Model/AllTracks//Edge | /TrackMate/Model/AllTracks/Track'))
2
  • this is not very hard using xpath.. but please add desired output to your question... Commented Feb 11, 2019 at 14:15
  • 1
    Sorry, I added the desired output. Commented Feb 11, 2019 at 14:22

1 Answer 1

5

The 'trick' is to get a list of alle the edge-nodes, and work with xpath from there... You can select the Trach-node from each Edge-node using the ancestor from xpath.

libraries used

#load libraries
library( xml2 )
library( magrittr )

sample data

doc <- read_xml('<?xml version="1.0" encoding="UTF-8"?>
  <TrackMate version="3.8.0">
    <Model spatialunits="µm" timeunits="sec">
      <AllTracks>
      <Track name="Track_2" TRACK_ID="2" NUMBER_SPOTS="140" NUMBER_GAPS="0" >
        <Edge SPOT_SOURCE_ID="960769" SPOT_TARGET_ID="960778" LINK_COST="0.08756957830926632" />
          <Edge SPOT_SOURCE_ID="958304" SPOT_TARGET_ID="958308" LINK_COST="1.4003359672950089" />
            <Edge SPOT_SOURCE_ID="958316" SPOT_TARGET_ID="958322" LINK_COST="1.6985623204008202" />
              </Track>
              <Track name="Track_145" TRACK_ID="145" NUMBER_SPOTS="141" NUMBER_GAPS="0" >
                <Edge SPOT_SOURCE_ID="961623" SPOT_TARGET_ID="961628" LINK_COST="2.2678642015413755" />
                  <Edge SPOT_SOURCE_ID="962122" SPOT_TARGET_ID="962127" LINK_COST="38.20777704254654" />
                    <Edge SPOT_SOURCE_ID="961869" SPOT_TARGET_ID="961873" LINK_COST="0.2895609647324684" />
                      </Track>
                      </AllTracks>
                      </Model>
                      </TrackMate>')

code

#find all edge nodes
edge.nodes <- xml_find_all( doc, ".//Edge")
#build the data.frame
data.frame( TRACK_ID = xml_find_first( edge.nodes, ".//ancestor::Track") %>% xml_attr("TRACK_ID"),
            SPOT_SOURCE_ID = edge.nodes %>% xml_attr("SPOT_SOURCE_ID"),
            SPOT_TARGET_ID = edge.nodes %>% xml_attr("SPOT_TARGET_ID"),
            LINK_COST = edge.nodes %>% xml_attr("LINK_COST") )

output

#   TRACK_ID SPOT_SOURCE_ID SPOT_TARGET_ID           LINK_COST
# 1        2         960769         960778 0.08756957830926632
# 2        2         958304         958308  1.4003359672950089
# 3        2         958316         958322  1.6985623204008202
# 4      145         961623         961628  2.2678642015413755
# 5      145         962122         962127   38.20777704254654
# 6      145         961869         961873  0.2895609647324684
Sign up to request clarification or add additional context in comments.

2 Comments

@Parfait Good to remember! But since the pipes in my answer provide better readable code (and thus code that is easier to maintain/review, at least: for me ;-) ), I leave them in.
This is very readable AND allows me now to find even more complicated solutions on my own.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.