0

I work on a node.js application for processing and loading large amounts of geospatial data from files into a JSON document database.

The source data is in the form of large (up to 10's of GB) XML documents. I used sax.js to parse the source documents, which gives me JavaScript objects representative of the XML structure:

{ name: 'gml:featureMember',
  attributes: {},
  isSelfClosing: false,
  parent: null,
  children: 
   [ '\r\n        ',
     { name: 'AX_BesondereFlurstuecksgrenze',
       attributes: { 'gml:id': 'DEHHALKAn0007s8z' },
       isSelfClosing: false,
       children: 
        [ '\r\n          ',
          { name: 'gml:identifier',
            attributes: { codeSpace: 'http://...' },
            isSelfClosing: false,
            children: [ 'urn:adv:oid:...' ] },
          '\r\n          ',
          { name: 'lebenszeitintervall',
            attributes: {},
            isSelfClosing: false,
            children: 
             [ '\r\n            ',
               { name: 'AA_Lebenszeitintervall',
                 attributes: {},
                 isSelfClosing: false,
                 children: 
                  [ '\r\n              ',
                    { name: 'beginnt',
                      attributes: {},
                      isSelfClosing: false,
                      children: [ '2010-03-07T08:32:05Z' ] },
                    '\r\n            ' ] },
               '\r\n          ' ] },
          ...

However, sax.js apparently gives no access to the current fragment. So I am looking for a way to get an XML Fragment from sax.js or a different stream parser. As I am on Windows, I would like to use only modules that don't require compilation.

7
  • You can try using XPath/Xquery. Commented Jan 11, 2016 at 12:50
  • Is there an xpath/yquery implementation that is based just on sax.js and doesn't require compilation? I briefly looked at saxtract and other, but they all seem to use libxmljs. Commented Jan 11, 2016 at 13:46
  • To get only the XML fragment, you can directly use XPath in javascript. Nice introduction is given here: timkadlec.com/2008/02/xpath-in-javascript-introduction Commented Jan 11, 2016 at 13:54
  • @Jagrut I saw that there is a pure javascript implementation of Xpath for node.js as well npmjs.com/package/xpath.js(), but it requires a DOM parser. I don't think I can use a DOM parser for XML files with several Gigabytes. Commented Jan 11, 2016 at 14:27
  • Alright, I followed the Xpath path and was able to solve the issue using npmjs.com/package/saxpath. Memory usage stayed below 70MB in node while processing a 1.7GB file, though there were some longer (garbage collection?) delays during processing. Commented Jan 11, 2016 at 14:47

1 Answer 1

1

Based on a suggestion by @Jagrut, I searched for an XPath implementation for node.js that works with sax.js and doesn't require a DOM or a native library. I found saxpath that fits the bill.

Usage is as follows:

var fs = require('fs');
var saxParser = require('sax').createStream(true);
var saxPath = require('saxpath');

var dataURL = '../data/ALKIS_FHH_0167.xml';
var count = 0;

parseXML(dataURL);

function parseXML(fileName) {

    var fileStream = fs.createReadStream(fileName);
    var streamer = new saxPath.SaXPath(saxParser, '//gml:featureMember');

    streamer.on('match', function(xml) {
        addFeature(xml);
    });

    fileStream.pipe(saxParser);
}

function addFeature (featureFragment) {
    // for now we just count features...
    if (count % 100 == 0) {
        console.log("Parsing fragment " + count);
    }
    count++;
}

It has a much nicer API than directly using sax.js. The only caveat I have noticed is that parsing sometimes stops for several seconds, probably due to GC. I tested this with XML files up to 1.7GB.

Sign up to request clarification or add additional context in comments.

1 Comment

I am not much aware of saxpath, but can you, instead of loading the entire XML in first place, directly apply XPath on the document and then parse it? I am not sure, but just a thought!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.