Parsing large XML file in R is very slow

Question

I need to extract data from a large xml file in R. The file size is 60 MB. I use the following R code to download the data from the Internet:

library(XML)
library(httr)

url = "http://hydro1.sci.gsfc.nasa.gov/daac-bin/his/1.0/NLDAS_NOAH_002.cgi"
SOAPAction = "http://www.cuahsi.org/his/1.0/ws/GetSites"
envelope = "<?xml version=\"1.0\" encoding=\"utf-8\"?>\n<soap:Envelope xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xmlns:xsd=\"http://www.w3.org/2001/XMLSchema\" xmlns:soap=\"http://schemas.xmlsoap.org/soap/envelope/\">\n<soap:Body>\n<GetSites xmlns=\"http://www.cuahsi.org/his/1.0/ws/\">\n<site></site><authToken></authToken>\n</GetSites>\n</soap:Body>\n</soap:Envelope>"

response = POST(url, body = envelope,
             add_headers("Content-Type" = "text/xml", "SOAPAction" = SOAPAction))
status.code = http_status(response)$category

Once I have received the response from the server, I use the following code to parse the data into a data.frame:

# Parse the XML into a tree
WaterML = content(response, as="text")
SOAPdoc = xmlRoot(xmlTreeParse(WaterML, getDTD=FALSE, useInternalNodes = TRUE))
doc = SOAPdoc[[1]][[1]][[1]]

# Allocate a new empty data frame with same name of rows as the number of sites
N = xmlSize(doc) - 1
df = data.frame(SiteName=rep("",N),
             SiteID=rep(NA, N),
             SiteCode=rep("",N),
             Latitude=rep(NA,N),
             Longitude=rep(NA,N),
             stringsAsFactors=FALSE)

# Populate the data frame with the values
# This loop is VERY SLOW it takes around 10 MINUTES!
start.time = Sys.time()

for(i in 1:N){  
  siteInfo = doc[[i+1]][[1]]
  siteList = xmlToList(siteInfo)
  siteName = siteList$siteName
  sCode = siteList$siteCode
  siteCode = sCode$text
  siteID = ifelse(is.null(sCode$.attrs["siteID"]), siteCode,   sCode$.attrs["siteID"])
  latitude = as.numeric(siteList$geoLocation$geogLocation$latitude)
  longitude = as.numeric(siteList$geoLocation$geogLocation$longitude) 
}

end.time = Sys.time()
time.taken = end.time - start.time
time.taken

The for loop that I use to parse the XML into a data.frame is very slow. It takes around 10 minutes to complete. Is there any way to make the loop faster?

This is a very large XML dataset, so it's not surprising that it takes quite some time to parse using the XML-specific libraries. If the data are extremely structured then you can easily write your own looping structure along with some regular expressions to parse the data. But it seems like this is a one-time problem so 10 minutes seems a decent trade-off for a solution that would likely take longer than 10 minutes to solve? — Forrest R. Stevens
– Forrest R. Stevens, Commented Jun 11, 2015 at 0:51
For me this is not a one-time problem because the online XML dataset is being updated every day. So I need to make the parsing as fast as possible. — jirikadlec2
– jirikadlec2, Commented Jun 11, 2015 at 1:30
Is the bottleneck in calling xmlToList so many times (what's a typical value of N?)? Could you convert the whole xmldoc to a list once and work with that? 60Mb isn't really "large" (it should fit in RAM) so I'd expect it to be possible and that might be faster. — Spacedman
– Spacedman, Commented Jun 11, 2015 at 6:37
Also, your code doesn't actually change df in the loop, so its not populating the data frame! Write it as a function with N as a parameter so you can test it on fewer rows in order to not have to wait 20 minutes to see if your code works. — Spacedman
– Spacedman, Commented Jun 11, 2015 at 7:45
You don't need to do the xmlToList conversion in order to extract the elements. Try accessing the nodes by name, eg: doc[[123]][[1]][["geoLocation"]][["geogLocation"]][["latitude"]][["text"]] gets you the latitude. Or by number if you are confident the format is constant (eg: doc[[123]][[1]][[3]][[1]][[1]][["text"]]. Also, do conversions to numeric at the end on the whole data frame column (df$latitude = as.numeric(df$latitude)). — Spacedman
– Spacedman, Commented Jun 11, 2015 at 7:50

Joshua Ulrich · Accepted Answer · 2015-06-12 01:00:23Z

4

I was able to get better performance by using xpath expressions to extract the information you want. Each of the calls to xpathSApply takes ~20 seconds on my laptop, so all the commands complete in less than 2 minutes.

# you need to specify the namespace information
ns <- c(soap="http://schemas.xmlsoap.org/soap/envelope/",
        xsd="http://www.w3.org/2001/XMLSchema",
        xsi="http://www.w3.org/2001/XMLSchema-instance",
        sr="http://www.cuahsi.org/waterML/1.0/",
        gsr="http://www.cuahsi.org/his/1.0/ws/")

Data <- list(
  siteName = xpathSApply(SOAPdoc, "//sr:siteName", xmlValue, namespaces=ns),
  siteCode = xpathSApply(SOAPdoc, "//sr:siteCode", xmlValue, namespaces=ns),
  siteID = xpathSApply(SOAPdoc, "//sr:siteCode", xmlGetAttr, name="siteID", namespaces=ns),
  latitude = xpathSApply(SOAPdoc, "//sr:latitude", xmlValue, namespaces=ns),
  longitude = xpathSApply(SOAPdoc, "//sr:longitude", xmlValue, namespaces=ns))
DataFrame <- as.data.frame(Data, stringsAsFactors=FALSE)
DataFrame$latitude <- as.numeric(DataFrame$latitude)
DataFrame$longitude <- as.numeric(DataFrame$longitude)

answered Jun 12, 2015 at 1:00

Joshua Ulrich

177k33 gold badges357 silver badges429 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

jirikadlec2 Over a year ago

Very good solution, it took ~2 minutes on my laptop to complete. And thank you for the example how to use xpathSApply. I suspected that the "xpath" functions would be faster, but I couldn't figure out how to correctly specify the namespace information. Your answer is a great example how to use xpath in R to parse an XML document with lots of namespaces.

Collectives™ on Stack Overflow

Parsing large XML file in R is very slow

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related