I need to extract data from a large xml file in R. The file size is 60 MB. I use the following R code to download the data from the Internet:
library(XML)
library(httr)
url = "http://hydro1.sci.gsfc.nasa.gov/daac-bin/his/1.0/NLDAS_NOAH_002.cgi"
SOAPAction = "http://www.cuahsi.org/his/1.0/ws/GetSites"
envelope = "<?xml version=\"1.0\" encoding=\"utf-8\"?>\n<soap:Envelope xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xmlns:xsd=\"http://www.w3.org/2001/XMLSchema\" xmlns:soap=\"http://schemas.xmlsoap.org/soap/envelope/\">\n<soap:Body>\n<GetSites xmlns=\"http://www.cuahsi.org/his/1.0/ws/\">\n<site></site><authToken></authToken>\n</GetSites>\n</soap:Body>\n</soap:Envelope>"
response = POST(url, body = envelope,
add_headers("Content-Type" = "text/xml", "SOAPAction" = SOAPAction))
status.code = http_status(response)$category
Once I have received the response from the server, I use the following code to parse the data into a data.frame:
# Parse the XML into a tree
WaterML = content(response, as="text")
SOAPdoc = xmlRoot(xmlTreeParse(WaterML, getDTD=FALSE, useInternalNodes = TRUE))
doc = SOAPdoc[[1]][[1]][[1]]
# Allocate a new empty data frame with same name of rows as the number of sites
N = xmlSize(doc) - 1
df = data.frame(SiteName=rep("",N),
SiteID=rep(NA, N),
SiteCode=rep("",N),
Latitude=rep(NA,N),
Longitude=rep(NA,N),
stringsAsFactors=FALSE)
# Populate the data frame with the values
# This loop is VERY SLOW it takes around 10 MINUTES!
start.time = Sys.time()
for(i in 1:N){
siteInfo = doc[[i+1]][[1]]
siteList = xmlToList(siteInfo)
siteName = siteList$siteName
sCode = siteList$siteCode
siteCode = sCode$text
siteID = ifelse(is.null(sCode$.attrs["siteID"]), siteCode, sCode$.attrs["siteID"])
latitude = as.numeric(siteList$geoLocation$geogLocation$latitude)
longitude = as.numeric(siteList$geoLocation$geogLocation$longitude)
}
end.time = Sys.time()
time.taken = end.time - start.time
time.taken
The for loop that I use to parse the XML into a data.frame is very slow. It takes around 10 minutes to complete. Is there any way to make the loop faster?
dfin the loop, so its not populating the data frame! Write it as a function withNas a parameter so you can test it on fewer rows in order to not have to wait 20 minutes to see if your code works.xmlToListconversion in order to extract the elements. Try accessing the nodes by name, eg:doc[[123]][[1]][["geoLocation"]][["geogLocation"]][["latitude"]][["text"]]gets you the latitude. Or by number if you are confident the format is constant (eg:doc[[123]][[1]][[3]][[1]][[1]][["text"]]. Also, do conversions to numeric at the end on the whole data frame column (df$latitude = as.numeric(df$latitude)).