XML XSLT Stream large xml file with SAXON EE10.6

Question

I have to import large xml files (>5Gb) into SOLR. I want to transform a xml file first with SAXON EE10.6 and streaming xsl. I have read it should be possible with SAXON EE10.6, but I get the following error:

Error on line 20 column 34 of mytest.xsl: XTSE3430 Template rule is not streamable

There is more than one consuming operand: {<field {(attr{name=...}, ...)}/>} on line 21, and {xsl:apply-templates} on line 27
The result of the template rule can contain streamed nodes Template rule is not streamable
There is more than one consuming operand: {<field {(attr{name=...}, ...)}/>} on line 21, and {xsl:apply-templates} on line 27
The result of the template rule can contain streamed nodes

I am not familiar with streaming xslt and Saxon. How to get my xslt right for streaming to output the needed Solr add document xml.

I have a fiddle here with a simplified version of my xml and the xslt I use: https://xsltfiddle.liberty-development.net/asoTKU

It is working great for smaller xml files (<1Gb)

Start with saxonica.com/html/documentation10/sourcedocs/streaming and try to learn. Also explain what your stylesheet is trying to achieve and show the relevant parts in the post. In general the easiest way to have two downwards selection is to switch to non-streamable mode that processes copy-of() of a streamed node that is "small" enough (e.g. perhaps a Property element) to be materialized with all its children/descendants. But don't pretend us to understand or guess why you match on node() where you seem to have a clear intention to process an element node, for instance. — Martin Honnen
– Martin Honnen, Commented Sep 21, 2021 at 18:27
If you are desperate, another option is to use xsl:fork to have two branches of downwards selection where the processor then needs to find a buffer strategy to e.g. collect all child values of a category but also needs to process them separately. But there is not one single approach that magically makes your code streamable, you will need to invest some time in understanding the limitations of streaming (forwards only parsing, "buffering" the current node (e.g. a element nodes with its attributes or a comment or a text node, maintaining some ancestor hierarchy but not the sibling hierarchy). — Martin Honnen
– Martin Honnen, Commented Sep 21, 2021 at 18:37

Michael Kay · Accepted Answer · 2021-09-21 22:08:17Z

The rules for XSLT 3.0 streaming are incredibly complicated, and it doesn't help that there are few tutorial introductions. One extremely useful resource is Abel Braaksma's talk at XML Prague 2014: there's a transcript and a link to the YouTube recording at https://www.xfront.com/Transcript-of-Abel-Braaksma-talk-on-XSLT-Streaming-at-XML-Prague-2014.pdf

The most important rule to remember is: a template rule can only make one downward selection (it only gets one chance to scan the descendant tree). That's the rule you've broken when you wrote:

<xsl:template match="node()">
   <xsl:element name="field">
      <xsl:attribute name="name">
        <xsl:value-of select="local-name()"/>
      </xsl:attribute>
      <xsl:value-of select="."/>
   </xsl:element>
   <xsl:apply-templates select="*"/>
</xsl:template>

Actually, that code could be simplified to

<xsl:template match="node()">
   <field name="{local-name()}">{.}</field>
   <xsl:apply-templates select="*"/>
</xsl:template>

But this wouldn't affect the stream ability: you're processing the descendants of the matched node twice, once to get the string value (.), and once to apply-templates to the children.

Now, it looks to me as if this template rule is only being used to process "leaf elements", that is, elements that have a text node child but no child elements. If that's the case, then the <xsl:apply-templates select="*"/> never selects anything: it's redundant and it can be removed, which makes the rule streamable.

There's another error message you're getting, which is that the template rule can return streamed nodes. The reason it's not permitted to return streamed nodes is a bit more subtle; it basically makes it impossible for the processor to do the data flow analysis to prove whether or not streaming is feasible. But it's again the <xsl:apply-templates select="*"/> that's the cause of the problem and getting rid of it fixes things.

Your next problem is with the template rule for Property elements. You've written this as

   <xsl:template match="Property">
        <xsl:element name="field">
            <xsl:attribute name="name">
               <xsl:value-of select="key"/>_s</xsl:attribute>
            <xsl:value-of select="value"/>
        </xsl:element>
        <xsl:apply-templates select="Property"/>
    </xsl:template>

and it simplifies to:

<xsl:template match="Property">
    <field name="{key}_s">{value}</field>
    <xsl:apply-templates select="Property"/>
</xsl:template>

This is making three downward selections: child::key, child::value, and child::Property. In your data sample, no Property element has a child called Property, so perhaps the <xsl:apply-templates/> is again redundant. For key and value one useful trick is to read them into a map:

<xsl:template match="Property">
    <xsl:variable name="pair" as="map(*)">
      <xsl:map>
        <xsl:map-entry key="'key'" select="string(key)"/>
        <xsl:map-entry key="'value'" select="string(value)"/>
      </xsl:map>
    </xsl:variable>
    <field name="{$pair?key}_s">{$pair?value}</field>
</xsl:template>

The reason this works is that xsl:map (like xsl:fork) is an exception to the "one downward selection" rule - the map can be built up in a single pass of the input. By calling string(), we're careful not to put any streamed nodes into the map, so the data we need later has been captured in the map and we don't ever need to go back to the streamed input document to read it a second time.

I hope this gives you a feel for the way forward. Streaming in XSLT is not for the faint-hearted, but if you've got >5Gb input documents then you don't have many options open.

Thank you, I will try your suggestions. Property is not nested, my fault in xsl. Another way would be to split up the xml files in smaller files and skip the streaming part.
Yes, that's a common design approach if you're able to transform each "record" in your input (whatever a "record" is) independently of all the others. Write a streaming mode with a template rule matching the "records" in your source; this template rule does <xsl:apply-templates select="copy-of(.)" mode="ns"/> where ns is a non-streaming mode that processes each "record" independently.

Martin Honnen · Accepted Answer · 2021-09-21 19:34:43Z

Assuming your Properties elements and Category are "small" enough to be buffered I guess

<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" expand-text="yes">

  <xsl:output method="xml" encoding="utf-8" indent="yes" />
  
  <xsl:strip-space elements="*"/>
  
  <xsl:mode streamable="yes" on-no-match="shallow-skip"/>
  
  <xsl:mode name="grounded"/>
  
  <xsl:template match="Properties | Category">
    <xsl:apply-templates select="copy-of()" mode="grounded"/>
  </xsl:template>
  
  <xsl:template match="Category" mode="grounded">
    <field name="Category">{.}</field>
    <xsl:apply-templates mode="#current"/>
  </xsl:template>
  
  <xsl:template match="Properties" mode="grounded">
    <field name="Properties">{.}</field>
    <xsl:apply-templates mode="#current"/>
  </xsl:template>
  
  <xsl:template match="Category/*" mode="grounded">
    <field name="CAT_{local-name()}_s">{.}</field>
  </xsl:template>

  <xsl:template match="Property" mode="grounded">
    <field name="{key}_s">{value}</field>
  </xsl:template>

  <xsl:template match="Item/*[not(self::Category | self::Properties)]">
    <field name="{local-name()}">{.}</field>
  </xsl:template>

  <xsl:template match='/Items'>
    <add>
      <xsl:apply-templates select="Item"/>
    </add>
  </xsl:template>

  <xsl:template match="Item">
    <xsl:variable name="pos" select="position()"/>
    <doc>
      <xsl:apply-templates>
        <xsl:with-param name="pos"><xsl:value-of select="$pos"/></xsl:with-param>
      </xsl:apply-templates>
    </doc>
  </xsl:template>

</xsl:stylesheet>

But your code (doing <xsl:apply-templates select="Property"/> in <xsl:template match="Property">) suggests that perhaps Property elements can be recursively nested, that could then with arbitrary nesting cause memory problems if the code attempts, like done above, to buffer the first Property it encounters, using copy-of(), in memory.

Your sample XML, however, doesn't have any nested Property elements.

Part of the xsl:fork strategy I commented on is used in

<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" expand-text="yes">

  <xsl:output method="xml" encoding="utf-8" indent="yes" />
  
  <xsl:strip-space elements="*"/>
  
  <xsl:mode streamable="yes"/>
  
  <xsl:mode name="text" streamable="yes"/>
  
  <xsl:mode name="grounded"/>
  
  <xsl:template match="Category">
    <xsl:apply-templates select="copy-of()" mode="grounded"/>
  </xsl:template>
  
  <xsl:template match="Properties">
    <xsl:fork>
      <xsl:sequence>
        <field name="Properties">
          <xsl:apply-templates mode="text"/>
        </field>
      </xsl:sequence>
      <xsl:sequence>
        <xsl:apply-templates/>
      </xsl:sequence>
    </xsl:fork>
  </xsl:template>
  
  <xsl:template match="Category" mode="grounded">
    <field name="Category">{.}</field>
    <xsl:apply-templates mode="#current"/>
  </xsl:template>
  
  <xsl:template match="Category/*" mode="grounded">
    <field name="CAT_{local-name()}_s">{.}</field>
  </xsl:template>
  
  <xsl:template match="Property">
    <xsl:apply-templates select="copy-of()" mode="grounded"/>
  </xsl:template>

  <xsl:template match="Property" mode="grounded">
    <field name="{key}_s">{value}</field>
  </xsl:template>

  <xsl:template match="Item/*[not(self::Category | self::Properties)]">
    <field name="{local-name()}">{.}</field>
  </xsl:template>

  <xsl:template match='/Items'>
    <add>
      <xsl:apply-templates select="Item"/>
    </add>
  </xsl:template>

  <xsl:template match="Item">
    <xsl:variable name="pos" select="position()"/>
    <doc>
      <xsl:apply-templates>
        <xsl:with-param name="pos"><xsl:value-of select="$pos"/></xsl:with-param>
      </xsl:apply-templates>
    </doc>
  </xsl:template>

</xsl:stylesheet>

That avoids explicitly constructing "a tree" for each Properties element but I have no idea what strategies Saxon applies to make sure both branches of the xsl:fork have access to the child or descendant contents.

As regards xsl:fork, all branches (prongs?) of the fork are notified of input events as they occur, effectively in parallel (though it all happens in a single thread). The thing you need to be aware of is that the output of the various prongs is buffered so it can be assembled in the right order. So xsl:fork works well when the input is large but the output is small.
Property is not nested, its incorrect in the xsl. I will try your suggestions to get the streaming right. Thanks
@MarcoDuindam, did one of the suggestions work out against the 5 GB input?

Marco Duindam · Accepted Answer · 2021-09-24 06:51:04Z

0

The given xsl solutions worked on the simplified version. However on the big >5Gb in full xml format I did not get it to work. I have solved it to split the xml files in about 1Gb files and then do the xsl without streaming.

And if somebody wants a challenge, contact me private ;)

answered Sep 24, 2021 at 6:51

Marco Duindam

1352 silver badges8 bronze badges

1 Comment

Martin Honnen Over a year ago

Well, the interesting part of your solution would still be how exactly you managed to split the 5GB document into smaller ones? Did you do that using SAX or Stax or how?

Marco Duindam · Accepted Answer · 2021-09-28 07:05:05Z

My xml files have an linefeed after every item. So I created a simple console app that splits files at 500.000 lines, remove null characters and transform the result with the xsl:

cleanxml.exe items.xml temp-items-solr.xml import.xsl

        static void Main(string[] args)
        {
            string line;

            XslCompiledTransform xsltTransform = new XslCompiledTransform();
            xsltTransform.Load(@args[2]);

            string fileToWriteTo = args[1];
            StreamWriter writer = new StreamWriter(fileToWriteTo);
            StreamReader file = new System.IO.StreamReader(@args[0]);

            string fileOriginal = @args[1];
            string firstLine = "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"yes\"?><Items>";

            int i = 0;
            int j = 1;
            while ((line = file.ReadLine()) != null)
            {

                writer.WriteLine(CleanInvalidXmlChars(line)); 

                if(i > 500000)
                {
                    writer.WriteLine("</Items>"); 
                    writer.Flush();
                    writer.Dispose();

                    xsltTransform.Transform(fileToWriteTo, fileToWriteTo.Replace("temp-",""));

                    System.IO.File.Delete(fileToWriteTo);
                    fileToWriteTo = fileOriginal.Replace(".xml", "-" + j.ToString() + ".xml");
                    writer = new StreamWriter(fileToWriteTo);
                    writer.WriteLine(firstLine);

                    i = 0;
                    j += 1;
                }
                i += 1;
            }

            writer.Flush();
            writer.Dispose();

            xsltTransform.Transform(fileToWriteTo, fileToWriteTo.Replace("temp-", ""));
            System.IO.File.Delete(fileToWriteTo);

            file.Close();
        }


        private static MemoryStream ApplyXSLT(string xmlInput, string xsltFilePath)
        {
            XmlDocument xmlDocument = new XmlDocument();
            xmlDocument.LoadXml(xmlInput);

            XslCompiledTransform xsltTransform = new XslCompiledTransform();
            xsltTransform.Load(xsltFilePath);

            MemoryStream memoreStream = new MemoryStream();
            xsltTransform.Transform(xmlDocument, null, memoreStream);
            memoreStream.Position = 0;

            return memoreStream;
        }


        public static string CleanInvalidXmlChars(string text)
        {
            string re = @"[^\x09\x0A\x0D\x20-\xD7FF\xE000-\xFFFD\x10000-x10FFFF]";
            return Regex.Replace(text, re, "");
        }

Collectives™ on Stack Overflow

XML XSLT Stream large xml file with SAXON EE10.6

4 Answers 4

2 Comments

3 Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

3 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related