4

I need to process a large KML file (>3 MiBs). To inspect it, I would need to look into it, but there is so many Style and StyleMap nodes that manual browsing becomes impossible. I have decided to remove the unnecessary nodes programmatically with Node.js. It is rather easy to parse an XML file with Node.js for example by using sax or xmldom. But the tricky part seems to be how to exclude certain nodes and their children and keep all the others. It becomes a rather complex task with sax because the output is XML so all kept nodes, their attributes and children must be processed. I feel there should be a simpler and more robust solution. Any suggestions and code snippets?

5
  • 2
    Search any xml-parser package on npm, include it, read your file, remove certain nodes, save to file and voilà. What exactly are you asking? Commented Oct 7, 2017 at 23:34
  • @xDreamCoding Thanks, I was looking for a general approach, that you briefly described, and a code snippet. Especially the part how the nodes should be removed. I edited the question to be more specific. I found that xpath might be able to do this. If it works well, I guess I will implement a npm module for this. Commented Oct 8, 2017 at 10:37
  • You want to transform the XML file. XSLT is your friend. Commented Oct 8, 2017 at 10:49
  • you can select nodes you want with camaro Commented Oct 16, 2017 at 16:10
  • @AnhThangBui Thanks for the tip. However, the problem is a bit different. I do not know all the nodes and props beforehand because the file I'm try to process is so large that I cannot inspect it. I just want to remove matching nodes and keep all the rest, regardless their names, props, or children. Commented Oct 17, 2017 at 18:21

1 Answer 1

3

One way is to use xmldom and xpath. First, fetch the nodes to remove by using xpath and XPath expressions. It returns an array of xmldom nodes that can be removed from the DOM tree. For example to remove all book nodes:

var xmldom = require('xmldom');
var xpath = require('xpath');

var parser = new xmldom.DOMParser();
var serializer = new xmldom.XMLSerializer();

var xmlIn = '<bookstore>' +
    '<book>Animal Farm</book>' +
    '<book>Nineteen Eighty-Four</book>' +
    '<essay>Reflections on Writing</essay>' +
  '</bookstore>';

var root = parser.parseFromString(xmlIn, 'text/xml');

var nodes = xpath.select('//book', root);

nodes.forEach(function (n) {
  n.parentNode.removeChild(n);
});

var xmlOut = serializer.serializeToString(root);

However, dealing with namespaces, multiple XPath expressions, and indentation preservation is a struggle. Therefore I created a NPM module filterxml to lift the weights.

var filterxml = require('filterxml')
var patterns = ['//book'];
var namespaces = {};
filterxml(xmlIn, patterns, namespaces, function (err, xmlOut) {
  console.log(xmlOut);
});

Will output:

<bookstore><essay>Reflections on Writing</essay></bookstore>
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.