1

I have a file containing multiple XML declarations which I was able to detect and individually read them from this post: Parseing XML by R always return XML declaration error . The data comes from: https://www.google.com/googlebooks/uspto-patents-applications-text.html.

### read xml document with more than one <?xml declaration in R

lines   <- readLines("pa020829.xml")
start   <- grep('<?xml version="1.0" encoding="UTF-8"?>',lines,fixed=T)
end     <- c(start[-1]-1,length(lines))

get.xml <- function(i) {
  txt <- paste(lines[start[i]:end[i]],collapse="\n")
  # print(i)
  xmlTreeParse(txt,asText=T)
  # return(i)
}
docs <- lapply(1:10,get.xml)

> class(docs)
[1] "list"
> class(docs[1])
[1] "list"
> class(docs[[1]])
[1] "XMLDocument"         "XMLAbstractDocument"

The file docs contains 10 similar documents called docs[[1]], docs[[2]], ... . I managed to extract the root of a single doc and to insert it into a matrix:

root <- xmlRoot(docs[[1]])

d <- rbind(unlist(xmlSApply(root[[1]], function(x) xmlSApply(x, xmlValue))))

However, I need to write code that would automatically retrieve the data of all 10 documents and attach them to a single data frame. I tried the code below but it only retrieves the data of the first document's root and attaches it multiple times to the matrix.

d <- lapply(docs, function(x) rbind(unlist(xmlSApply(root, function(x) xmlSApply(x, xmlValue)))))

I guess I need to change the way I call the root in the function.

Any idea on how to create a matrix with the data from all the documents?

1 Answer 1

1

The following code will return a matrix containing the data from all the documents:

getXmlInternal <- function(x) {
  rbind(unlist(xmlSApply(xmlRoot(x), function(y) xmlSApply(y, xmlValue))))
}

d <- rbind(lapply(docs, function(x) getXmlInternal(x)))

This fixes the xmlRoot issue you mention by running that command on each of the documents supplied by the lapply command. The lapply command is wrapped in a call to rbind to ensure the output is in a matrix as requested.

The getXmlInternal function is included to make the answer a little more readable.

Sign up to request clarification or add additional context in comments.

3 Comments

perhaps explain a bit about what it's doing for the OP? perhaps also add some spacing (there are no extra points on SO for one-liners)
I have added some explanation and reformatted the code a little to improve readability.
Thanks you @makeyourownmaker for this. Unfortunately it returns a 1 row x 10 columns matrix with lists inside each row. I need to unlist the characters inside the columns into multiple rows. Can you help me with that?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.