0

I have big XML files (between 500MB and 1GB) and I'm trying to filter them in order to keep only nodes with some specified attributes, in this case Prod_id. I have about 10k Prod_id that I need to filter and currently XML contains about 60k items.

Currently I'm using XSL with node.js (https://github.com/fiduswriter/xslt-processor) but it's really slow (I never saw one of them finished in 30-40 minutes).

Is there a way to increase the speed of this process? XSL is not a requirement, I can use everything.

XML Example:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<products>
    <Product Quality="approved" Name="WL6A6" Title="BeBikes comfort WL6A6" Prod_id="BBKBECOMFORTWL6A6">
        <CategoryFeatureGroup ID="10030">
            <FeatureGroup>
                <Name Value="Dettagli tecnici" langid="5"/>
            </FeatureGroup>
        </CategoryFeatureGroup>
        <Gallery />
    </Product>
    ...
    <Product Quality="approved" Name="WL6A6" Title="BeBikes comfort WL6A6" Prod_id="LAL733">
        <CategoryFeatureGroup ID="10030">
            <FeatureGroup>
                <Name Value="Dettagli tecnici" langid="5"/>
            </FeatureGroup>
        </CategoryFeatureGroup>
        <Gallery />
    </Product>
</products>

XSL I'm using

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output method="xml" indent="yes"/>  
  <xsl:template match="@* | node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="
         products/Product
         [not(@Prod_id='CEESPPRIVAIPHONE4')]
         ...
         [not(@Prod_id='LAL733')]"
   />
</xsl:stylesheet>

Thanks

6
  • Do you want to do that with node.js? Or any tool/programming language/platform? Commented Jan 24, 2020 at 15:48
  • Any free tool/language/platform is ok Commented Jan 24, 2020 at 15:58
  • 1
    Given that you know the structure and simply want to read through forwards only to identify Product elements you want to keep or drop an XmlReader or SAX based code might help, for Python there is a similar problem answered in stackoverflow.com/a/42411493/252228. Of course XSLT can do it too but forwards only, not tree based XSLT is only available in XSLT 3 with streaming for which you would need Saxon EE (there is trial license). For normal XSLT 1 or 2 with "free" processors you could try whether a key speeds things up, the processor you have choosen doesn't seem to support them. Commented Jan 24, 2020 at 17:09
  • Saxon for node.js is not yet available, but hopefully it's a matter of a few weeks now. It won't offer streaming, so you will still need a lot of memory for a document this large. If you need a streaming XSLT processor, you will have to call out to Java, e.g via an HTTP request. Commented Jan 24, 2020 at 17:42
  • The SAX approach suggested by @MartinHonnen would be preferred in your case. Commented Jan 24, 2020 at 21:07

1 Answer 1

1

I solved using an approach similar to this answer https://stackoverflow.com/a/13851518/1152049

Thanks

private static void filter(InputStream fileInputStream, final Set<String> prodIdToExclude) throws SAXException, TransformerException, FileNotFoundException {
        XMLReader xr = new XMLFilterImpl(XMLReaderFactory.createXMLReader()) {
            private boolean skip;

            @Override
            public void startElement(String uri, String localName, String qName, Attributes atts)
                    throws SAXException {
                if (qName.equals("Product")) {
                    String prodId = atts.getValue("Prod_id");
                    if (prodIdToExclude.contains(prodId)) {
                        skip = true;
                    } else {
                        super.startElement(uri, localName, qName, atts);
                        skip = false;
                    }
                } else {
                    if (!skip) {
                        super.startElement(uri, localName, qName, atts);
                    }
                }
            }

            public void endElement(String uri, String localName, String qName) throws SAXException {
                if (!skip) {
                    super.endElement(uri, localName, qName);
                }
            }

            @Override
            public void characters(char[] ch, int start, int length) throws SAXException {
                if (!skip) {
                    super.characters(ch, start, length);
                }
            }
        };
        Source src = new SAXSource(xr, new InputSource(fileInputStream));
        Result res = new StreamResult(new FileOutputStream("output.xml"));
        TransformerFactory.newInstance().newTransformer().transform(src, res);
    }
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.