How to filter out XML nodes with Node.js?

Question

I need to process a large KML file (>3 MiBs). To inspect it, I would need to look into it, but there is so many Style and StyleMap nodes that manual browsing becomes impossible. I have decided to remove the unnecessary nodes programmatically with Node.js. It is rather easy to parse an XML file with Node.js for example by using sax or xmldom. But the tricky part seems to be how to exclude certain nodes and their children and keep all the others. It becomes a rather complex task with sax because the output is XML so all kept nodes, their attributes and children must be processed. I feel there should be a simpler and more robust solution. Any suggestions and code snippets?

Search any xml-parser package on npm, include it, read your file, remove certain nodes, save to file and voilà. What exactly are you asking? — xDreamCoding
– xDreamCoding, Commented Oct 7, 2017 at 23:34
@xDreamCoding Thanks, I was looking for a general approach, that you briefly described, and a code snippet. Especially the part how the nodes should be removed. I edited the question to be more specific. I found that xpath might be able to do this. If it works well, I guess I will implement a npm module for this. — Akseli Palén
– Akseli Palén, Commented Oct 8, 2017 at 10:37
@AnhThangBui Thanks for the tip. However, the problem is a bit different. I do not know all the nodes and props beforehand because the file I'm try to process is so large that I cannot inspect it. I just want to remove matching nodes and keep all the rest, regardless their names, props, or children. — Akseli Palén
– Akseli Palén, Commented Oct 17, 2017 at 18:21

Akseli Palén · Accepted Answer · 2017-10-09 18:34:51Z

One way is to use xmldom and xpath. First, fetch the nodes to remove by using xpath and XPath expressions. It returns an array of xmldom nodes that can be removed from the DOM tree. For example to remove all book nodes:

var xmldom = require('xmldom');
var xpath = require('xpath');

var parser = new xmldom.DOMParser();
var serializer = new xmldom.XMLSerializer();

var xmlIn = '<bookstore>' +
    '<book>Animal Farm</book>' +
    '<book>Nineteen Eighty-Four</book>' +
    '<essay>Reflections on Writing</essay>' +
  '</bookstore>';

var root = parser.parseFromString(xmlIn, 'text/xml');

var nodes = xpath.select('//book', root);

nodes.forEach(function (n) {
  n.parentNode.removeChild(n);
});

var xmlOut = serializer.serializeToString(root);

However, dealing with namespaces, multiple XPath expressions, and indentation preservation is a struggle. Therefore I created a NPM module filterxml to lift the weights.

var filterxml = require('filterxml')
var patterns = ['//book'];
var namespaces = {};
filterxml(xmlIn, patterns, namespaces, function (err, xmlOut) {
  console.log(xmlOut);
});

Will output:

<bookstore><essay>Reflections on Writing</essay></bookstore>

Collectives™ on Stack Overflow

How to filter out XML nodes with Node.js?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related