4

Im stuck trying to parse a big xml-file into an R - data.frame object. The xml has the following schema:

<?xml version="1.0" encoding="ISO-8859-1"?>
<?eclipse version="3.0"?>
  <ROOT>
  <row>
    <field name="dtcreated"></field>
    <field name="headline"></field>
    <subheadline/>
    <field name="body"></field>
  </row>
  <row>
    <field name="dtcreated"></field>
    <field name="headline"></field>
    <subheadline/>
    <field name="body"></field>
  </row>
</ROOT>

the plyr convenience functions didn't help, since the xml couldn't be validated. So I came up with the following code, using xpath queries:

adHocXml<-xmlTreeParse(adHocXmlPath,getDTD = FALSE)
adHocRoot<-xmlRoot(adHocXml)
creationDateColumn<-sapply(getNodeSet(adHocRoot,"//row//field[@name='dtcreated']"), xmlValue)
headlineColumn<-sapply(getNodeSet(adHocRoot,"//row//field[@name='headline']"), xmlValue)
bodyColumn<-sapply(getNodeSet(adHocRoot,"//row//field[@name='body']"), xmlValue)
adHocData<-data.frame(creationDate=creationDateColumn,headline=headlineColumn,body=bodyColumn)

The code does exactly what I expect it to do for a short file. With a large file and several thousand row-tags however, I get the following error after about 10 minutes:

Error: 1: internal error: Huge input lookup
2: Extra content at the end of the document 

Can anyone help me?

0

1 Answer 1

5

libxml has an upper limit on the size a single node can be. You can turn this limit off by enabling the parser flag XML_PARSE_HUGE. In R package XML you would do this as:

library(XML)
xmlParse(myXML, options = HUGE)

You may also want to look at xmlEventParse. Martin Morgan provides a good example on its use here.

Sign up to request clarification or add additional context in comments.

5 Comments

Thanks for your help So I tried it with adHocXml<-xmlTreeParse(adHocXmlPath,getDTD = FALSE,options = HUGE) , but still run into the same problem. Is this due to the complete XML file size (20MB) or due to the size of indidual node texts?
Try with xmlParse rather than xmlTreeParse. Or if you use xmlTreeParse use argument useInternalNodes = TRUE.
xmlParse just returns an empty object. That's why I am using xmlTreeParse, since this was the only method that could cope with my document.
The additional useInternalNodes = TRUE option solved my problem. Thanks a lot!
Happy to help. If the answer solves your problem consider marking the question as answered by ticking the box on the answer.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.