5

I am trying to parse an input xml file that is 13,00,000 lines long with a size of 56 MB, using xsltproc. I get the below error:

input.xml:245393: parser error : internal error: Huge input lookup
              "description" : "List of values for possible department codes"
                          ^
unable to parse input.xml

My xsltproc was able to process an xml file that was 9,30,000 lines long with a size of 48 MB.

In fact, I tried decreasing the xml lines to 600,000 by removing the unnecessary parts. Still, same error, which is strange, because it is able to parse 900,000 but not 600,000.

How do I resolve this issue?

9
  • There are some lookup limit defined in gitlab.gnome.org/GNOME/libxml2/blob/master/include/libxml/… but maxLength as 30sounds rather like an XSD schema related problem. Is that document referring to a schema? Is the error occuring with some xsl:key processing? Commented Dec 13, 2019 at 6:19
  • "maxLength:30" can be ignored. It's just a string in my input xml. Is there a way I can increase the XML_MAX_LOOKUP_LIMIT? I tried decreasing the xml lines to 600,000. Still, same error, which is strange, because it is able to parse 900,000 but not 600,000. Commented Dec 13, 2019 at 7:06
  • edited question to avoid confusion Commented Dec 13, 2019 at 7:11
  • 4
    48Mb is not a huge document. "Huge" these days is more like 48Gb. Commented Dec 13, 2019 at 9:12
  • 2
    stackoverflow.com/a/32115337/252228 suggests you can edit the source of libxml2 to set the XML_PARSE_HUGE parser option (which then I think disables any security based restrictions/limits normally set by default). Then you need to recompile. Or try to use one of the languages like Python or PHP which use libxml2, it seems they have options (e.g. lxml in lxml.de/parsing.html#parser-options declares huge_tree) to disable the security based limits at run-time. Commented Dec 15, 2019 at 21:44

3 Answers 3

3

Write your own xsltproc in Python based on this snippet:

import argparse

from lxml import etree

parser = argparse.ArgumentParser()
parser.add_argument('stylesheet', help='XSLT style sheet', type=argparse.FileType('r', encoding='utf-8'))
parser.add_argument('input', help='XML input file(s)', nargs='*', type=argparse.FileType('r', encoding='utf-8'))
parser.add_argument('--output', help='The output file to create.', type=argparse.FileType('wb'))

args = parser.parse_args()

transform = etree.XSLT(etree.parse(args.stylesheet))

xml_parser = etree.XMLParser(huge_tree=True)

for xml in args.input:
    transform(etree.parse(xml, xml_parser)).write_output(args.output)

This uses lxml as suggested in this answer.

The huge_tree=True argument sets the corresponding parser option in libxml2 and thus enables it to process large files. See Parser options for more information.

Sign up to request clarification or add additional context in comments.

Comments

1

libxslt 1.1.35 added a --huge option to xsltproc which disables some internal limits like XML_MAX_LOOKUP_LIMIT.

Comments

0

Using Oxygen XML Editor (Xalan) resolved my issue.

1 Comment

Not affiliated to and no intent to recommend any specific commerical product, just wanted to note that Altova's XmlSpy works too. You might also want to try my solution which is completely free.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.