In R, how can I loop over repeated XML nodes, and save text values in a list?

Question

I'm working with XML files from clinicaltrials.gov, which have a structure like this:

<clinical_study>
  ...
  <brief_title>
  ...
  <location>
    <facility>
      <name>
      <address>
        <city>
        <state>
        <zip>
        <country>
    </facility>
  </location>
  <location>
    ...
  </location>
  ...
</clinical_study>

I'm gathering information from multiple XML files, so the number of locations in each file is unknown and could even be zero. I need to extract all the information about each location and save into an SQL table. I've had some success using functions from the XML package to extract information from single nodes, e.g.

library(XML)
nct_url <- "http://clinicaltrials.gov/ct2/show/NCT00112281?resultsxml=true"
xml_doc <- xmlParse(nct_url, useInternalNode=TRUE)
title_path <- "/clinical_study/brief_title" 
title_text <- xpathSApply(xml_doc, title_path, xmlValue)

I'm experimenting with getNodeSet, and this gives me a set of the right length:

doc <- xmlParse("NCT00007501.xml")
locations <- getNodeSet(doc, "/clinical_study/location")
length(locations)
[1] 22
> class(locations)
[1] "XMLNodeSet"

but my attempts to extract information from this set have been mostly fruitless. Any suggestions?

userJT · Accepted Answer · 2013-12-17 17:27:21Z

4

Here is an example

 ns <- getNodeSet(xml, '//clinical_results/outcome_list/outcome/analysis_list/analysis/method')
 element_cnt <-length(ns))
 strings<-paste(sapply(ns, function(x) { xmlValue(x) }),collapse="|"))

answered Dec 17, 2013 at 17:27

userJT

12k20 gold badges82 silver badges96 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Jose R · Accepted Answer · 2014-03-25 04:26:20Z

1

This code will put a subset of nodes that correspond to <location> from a clinical trial into a data frame:

library(XML)
clinicalTrialUrl <- "http://clinicaltrials.gov/ct2/show/NCT01480479?resultsxml=true"
xmlDoc <- xmlParse(clinicalTrialUrl, useInternalNode=TRUE)
locations <- xmlToDataFrame(getNodeSet(xmlDoc,"//location"))

In this case there are 221 locations. However, the code assumes sort of a flat structure and lumps subnodes together. For example, anything under <facility> gets concatenated into a single string. I can go into the subnodes and put them one by one into a dataframe.

edited Mar 25, 2014 at 4:26

answered Mar 25, 2014 at 3:05

Jose R

9541 gold badge11 silver badges23 bronze badges

1 Comment

Jose R Over a year ago

There is a problem with the above solution, in which sub-nodes get concatenated into single long strings. I asked the question in another thread, and it got answered. One effective solution is to flatten the structure of the XML to properly fit tables. Here is the thread: stackoverflow.com/questions/22625960/…

agstudy · Accepted Answer · 2018-04-02 20:56:43Z

1

I don't understand why do you not use again xpathSApply, to retrieve locations as you already did for titles?!

xpathSApply(xml_doc, "//clinical_study/location" , xmlValue)

edited Apr 2, 2018 at 20:56

answered Dec 17, 2013 at 0:02

agstudy

122k18 gold badges205 silver badges265 bronze badges

4 Comments

Brian Doherty Over a year ago

The <location> node contains sub-nodes, and xmlValue just removes the tags and concatenates all the values together w/o a separator. If all I wanted was the <location><title></title></location> node, your suggestion is good and would return a list of all the location titles. Leaving off the "xmlValue" from the end of your suggestion gives me a list of XML fragments, but XML functions like xmlParse expect a file. Not sure how to process an XML fragment.

Brian Doherty Over a year ago

I might have answered my own question ... with your help. If I save the set of XML fragments in a list, then it seems I can again use xpathSApply with "list[1]" where I have "xml_doc" above. Thanks.

userJT Over a year ago

use dput(object) to see the structure of the object. Also, note that you can prepare the XML file with 2 methods xmlParse and xmlTreeParse. They produce different output. And yet internal nodes yes/no. It is hard to desipher all that. I wish the XML package had more for-dummies vignettes and 3+ examples for each problem.

Lauren Fitch Over a year ago

The argument above should be "//clinical_study/location". (Double forward slash)

Collectives™ on Stack Overflow

In R, how can I loop over repeated XML nodes, and save text values in a list?

3 Answers 3

Comments

1 Comment

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

1 Comment

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related