5

I am trying to extract information from an XML file from ClinicalTrials.gov. The file is organized in the following way:

<clinical_study>
  ...
  <brief_title>
  ...
  <location>
    <facility>
      <name>
      <address>
        <city>
        <state>
        <zip>
        <country>
    </facility>
    <status>
    <contact>
      <last_name>
      <phone>
      <email>
    </contact>
  </location>
  <location>
    ...
  </location>
  ...
</clinical_study>

I can use the R XML package from CRAN in the following code to extract all location nodes from the XML file:

library(XML)
clinicalTrialUrl <- "http://clinicaltrials.gov/ct2/show/NCT01480479?resultsxml=true"
xmlDoc <- xmlParse(clinicalTrialUrl, useInternalNode=TRUE)
locations <- xmlToDataFrame(getNodeSet(xmlDoc,"//location"))

This works kind of ok. However, if you look at the data frame, you will notice that the xmlToDataFrame function lumped together everything under <facility> into a single concatenated string. A solution would be to write code to generate the data frame column by column, for example, you could generate

2
  • 1
    You can do something like: xpathSApply(xmlDoc, "//clinical_study/location/facility/name", xmlValue) to suck each component of <facility> out separately. I'm not sure how to do it in one fell swoop though. Commented Mar 25, 2014 at 5:34
  • 1
    What you did worked perfectly for me. My XML file was simple. Commented Jul 23, 2014 at 20:16

2 Answers 2

8

You could flatten the XML first.

flatten_xml <- function(x) {
  if (length(xmlChildren(x)) == 0) structure(list(xmlValue(x)), .Names = xmlName(xmlParent(x)))
  else Reduce(append, lapply(xmlChildren(x), flatten_xml))
}

dfs <- lapply(getNodeSet(xmlDoc,"//location"), function(x) data.frame(flatten_xml(x)))
allnames <- unique(c(lapply(dfs, colnames), recursive = TRUE))
df <- do.call(rbind, lapply(dfs, function(df) { df[, setdiff(allnames,colnames(df))] <- NA; df }))
head(df)

 #          city      state   zip       country     status          last_name        phone                    email               last_name.1
 # 1  Birmingham    Alabama 35294 United States Recruiting Louis B Nabors, MD 205-934-1813          [email protected]        Louis B Nabors, MD
 # 2      Mobile    Alabama 36604 United States Recruiting Melanie Alford, RN 251-445-9649     [email protected]    Pamela Francisco, CCRP
 # 3     Phoenix    Arizona 85013 United States Recruiting     Lynn Ashby, MD 602-406-6262           [email protected]            Lynn Ashby, MD
 # 4      Tucson    Arizona 85724 United States Recruiting         Jamie Holt 520-626-6800 [email protected] Baldassarre Stea, MD, PhD
 # 5 Little Rock   Arkansas 72205 United States Recruiting   Wilma Brooks, RN 501-686-8530       [email protected]       Amanda Eubanks, APN
 # 6    Berkeley California 94704 United States  Withdrawn               <NA>         <NA>                     <NA>                      <NA>
Sign up to request clarification or add additional context in comments.

3 Comments

Thanks, it worked. For some reason my compiler didn't like the syntax of the function, so I had to change it to this: flatten_xml <- function(x) { if (length(xmlChildren(x)) == 0) {structure(list(xmlValue(x)), .Names = xmlName(xmlParent(x)))} else {Reduce(append, lapply(xmlChildren(x), flatten_xml))} }
Yes, I think we are using different versions. Fixed.
Don't forget to accept my answer when you get a chance. :)
3

This answer converts the XML to a list, unlists each location section, transposes the section, converts the section to a data.table, and then uses rbindlist to merge all of the individual locations into one table. The fill=T argument matches the elements by name, and fills in missing element values with NA.

library(XML); library(data.table)

clinicalTrialUrl <- "http://clinicaltrials.gov/ct2/show/NCT01480479?resultsxml=true"
xmlDoc <- xmlParse(clinicalTrialUrl, useInternalNode=TRUE)

xmlToDT <- function(doc, path) {
  rbindlist(
    lapply(getNodeSet(doc, path),
           function(x) data.table(t(unlist(xmlToList(x))))
    ), fill=T)
}

locationDT <- xmlToDT(xmlDoc, "//location")
locationDT[1:6]
##                                                                       facility.name facility.address.city facility.address.state facility.address.zip
## 1:                                                                "HYGEIA" Hospital               Marousi     District of Attica               151 23
## 2: Allina Health, Abbott Northwestern Hospital, John Nasseff Neuroscience Institute           Minneapolis              Minnesota                55407
## 3:                  Amrita Institute of Medical Sciences and Research Centre, Kochi                 Kochi                 Kerala              682 026
## 4:                                                      Anne Arundel Medical Center             Annapolis               Maryland                21401
## 5:                                                              Atlanta Cancer Care               Atlanta                Georgia                30005
## 6:                                                                    Austin Health            Heidelberg               Victoria                 3084
##    facility.address.country
## 1:                   Greece
## 2:            United States
## 3:                    India
## 4:            United States
## 5:            United States
## 6:                Australia

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.