14

I have a collection of XML files, and some of them are pretty big (up to ~50 million element nodes). I am using xmllint for validating those files, which works pretty nicely even for the huge ones thanks to the streaming API.

xmllint --loaddtd --stream --valid /path/to/huge.xml

I recently learned that xmllint is also capable of doing command line XPath queries, which is very handy.

xmllint --loaddtd --xpath '/root/a/b/c/text()' /path/to/small.xml

However, these XPath queries do not work for the huge XML files. I just receive a "Killed" message after some time. I tried to enable the streaming API, but this just leads to no output at all.

xmllint --loaddtd --stream --xpath '/root/a/b/c/text()' /path/to/huge.xml

Is there a way to enable streaming mode when doing XPath queries using xmllint? Are there other/better ways to do command line XPath queries for huge XML files?

7
  • try --shell option for interactive (with just the xml file path) Commented May 18, 2015 at 14:42
  • I tried opening the interactive shell for a huge file, but it will crash ("Killed", just as in the case of not using --stream) before I can enter any command. Commented May 18, 2015 at 15:00
  • superuser.com/questions/543881/… Commented Oct 7, 2015 at 12:57
  • 1
    attaching a sample XML file would help – I, for one, have no idea what large might mean in your case. Commented Jan 30, 2016 at 9:42
  • 1
    Think of something like the dblp XML dump (dblp.dagstuhl.de/xml). I receive the "Killed" error when parsing that file in a non-streaming context. But my question is aimed at essentially any file that is big enough such that you would be ill advised to build a DOM in main memory and should rather use a streaming approach instead. Commented Feb 1, 2016 at 10:33

2 Answers 2

5

If your XPath expressions are very simple, try xmlcutty.

From the homepage:

xmlcutty is a simple tool for carving out elements from large XML files, fast. Since it works in a streaming fashion, it uses almost no memory and can process around 1G of XML per minute.

Sign up to request clarification or add additional context in comments.

1 Comment

A command like xmllint --loaddtd --xpath '/root/a/b/c/text()' /path/to/small.xml would be translated into xmlcutty -path '/root/a/b/c' -rename '\n' /path/to/small.xml - where the rename is meant to rename the last enclosing element - and thus simulating a text() - the syntax is bit arcane.
-1

change ulimits might work. Try this:

$ ulimit -Sv 500000
$ xmllint (...your command)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.