How to preserve CDATA when using XSLT for "identity transform" with lxml in python?

Question

Here's my use case: I regularly receive a large (GB+ size) XML file from a customer, and all the XML is contained in a single line, without a single cr/lf in the file. What this means is that if there is a data issue in the XML that requires manual investigation, opening it for reading becomes problematic for all the tools we've tried, which apparently try to read entire lines at a time.

So I wrote the following code in Python 2.7 to apply the "identity transform" using XSLT, and pretty print the result to a file, thus inserting cr/lf's at the appropriate locations. This solves the problem of being able to open the file. The only issue is, it strips CDATA tags from the output, even though I've included the directive to preserve CDATA ("strip_cdata=False"). It also appears to create escaped versions of HTML fragments contained within the CDATA sections, i.e. replacing < with "<".

It's important, from a troubleshooting perspective, that the ONLY changes to the content are the addition of cr/lf's in the logical places in the XML. How can I modify the code to make that happen? Is it even possible using lxml?

Here's the current code:

from lxml import etree
import sys
import re
from datetime import datetime

start_time = datetime.now()

# get input file
infile = sys.argv[1]
outfile = infile[0:infile.rindex(".")]+".trns.xml"

# get XSLT file, if it exists, else use identity transform
xsl = ''
if len(sys.argv) > 2:
    xsl = etree.parse(sys.argv[2])
else:
    xsl =\
'<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">\
    <xsl:template match="node()|@*">\
      <xsl:copy>\
        <xsl:apply-templates select="node()|@*"/>\
      </xsl:copy>\
    </xsl:template>\
</xsl:stylesheet>'
xslt = etree.XML(xsl)
transform_function = etree.XSLT(xslt)

# transform
parser = etree.XMLParser(huge_tree=True, strip_cdata=False)
transformed = transform_function(etree.parse(infile, parser))

# write to output
open(outfile, 'w').write(etree.tostring(transformed, pretty_print=True))

# display run time
time = datetime.now() - start_time
reg3 = re.compile("\\d+:\\d(\\d:\\d+\\.\\d{4})")
time = re.search(reg3, unicode(time))
time = "Runtime: %ss" % (time.group(1).encode("utf-8"))
print(time)

Thanks for the comments. Per "XSLT's Identity transforms already pretty prints", perhaps there's an XSLT option I'm not using, but I can confirm, at least in this case, that if I remove pretty_print=True from the tostring() method, I do NOT get pretty-printed XML. — George
– George, Commented Feb 26, 2018 at 22:33
Also, I'll take your word for it that XSLT cannot preserve CData, though it seems ridiculous for that to be the case. I will probably end up writing a utility in Java to read the file as a stream, and write to stream with appropriate indents. — George
– George, Commented Feb 26, 2018 at 22:39
In case anyone else looks at this code in the future, prettifying during the XSLT transformation does work, but this element needs to be added to the XSL: <xsl:output method="xml" indent="yes"/> Also, I noticed only an extremely small increase in performance (<2%) when removing pretty_print=True from the tostring() method. — George
– George, Commented Feb 26, 2018 at 23:41
XSLT takes the view that CDATA is merely an input convenience, and that <a><![CDATA[<]]></a> and <a><</a>` are just different ways of inputting the same data; the user of the data shouldn't care about the detailed keystrokes used to input it. Unfortunately XML doesn't define a standard data model, but the model used by XPath and XSLT is pretty widely accepted. — Michael Kay
– Michael Kay, Commented Feb 26, 2018 at 23:48

TextGeek · Accepted Answer · 2018-02-26 23:19:33Z

1

Any time you pass XML through a parser, you can expect changes to the literal output, because there is lots of markup detail that a parser doesn't/shouldn't report.

The most obvious examples are whitespace inside of markup, the order of attributes, and the kind of quotes around the attributes. A SAX parser, for example, typically hands back the element type as a string, and the attributes as a dict or array of strings.

Since you really need the literal physical file unchanged except for breaking the lines (I had the same problem with an airplane repair manual and a very big poetry database, a long time ago in a galaxy fairly far away), how about just inserting a newline before every literal "<", using sed or a 3-line Python program?

answered Feb 26, 2018 at 23:19

TextGeek

1,24712 silver badges23 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

George Over a year ago

This is not a bad idea, though I wonder if the size of my files will break any of these tools. For example, xmllint does exactly what I need with the --format flag, but it can't handle the file sizes I'm dealing with. In any event, I've already written a formatter using streaming file IO...works great, but needs some performance tuning in the read block size. When I'm happy with it I'll make a Medium post on it including source code, and will link it here.

Parfait · Accepted Answer · 2018-02-27 03:25:35Z

Consider running two XSLT transformations where first builds another XSLT with all elements defined in cdata-section-elements and scripted Identity Transform. And second XSLT processes original source using resulting script from first with CData dynamically defined for every text node.

XSLT script is borrowed from guru @DimitreNovatchev's answer here. With Python, you can simply transfer the result from first transformation directly into second all in memory without saving anything to disk. See below with demo example of top StackOverflow users of XSLT and Python:

Input XML (no indent or line breaks)

<?xml version="1.0"?><stackoverflow>  <group lang="python"><topusers>
<user>Martijn Pieters</user>  <link>https://stackoverflow.com/users
/100297/martijn-pieters</link>  <location>Cambridge, United Kingdom 
</location>  <year_rep>14,102</year_rep>  <total_rep>624,972</total_rep>  
<tag1>python</tag1>  <tag2>python-3.x</tag2>  <tag3>python-2.7</tag3>
</topusers><topusers>  <user>Alex Martelli</user>  
<link>https://stackoverflow.com/users/95810/alex-martelli</link>  
<location>Sunnyvale, CA</location>  <year_rep>10,292</year_rep>  
<total_rep>565,346</total_rep>  <tag1>python</tag1>  <tag2>list</tag2>  
<tag3>c++</tag3></topusers><topusers>  <user>unutbu</user>  
<link>https://stackoverflow.com/users/190597/unutbu</link>  <location/>  
<year_rep>11,788</year_rep>  <total_rep>482,061</total_rep>  
<tag1>python</tag1>  <tag2>pandas</tag2>  <tag3>numpy</tag3></topusers>  
</group>  <group lang="xslt"><topusers>  <user>Dimitre Novatchev</user>  
<link>https://stackoverflow.com/users/36305/dimitre-novatchev</link>  
<location>United States</location>  <year_rep>2,028</year_rep>  
<total_rep>201,945</total_rep>  <tag1>xslt</tag1>  <tag2>xml</tag2>  
<tag3>xpath</tag3></topusers><topusers>  <user>Martin Honnen</user>  
<link>https://stackoverflow.com/users/252228/martin-honnen</link>  
<location>Germany</location>  <year_rep>2,463</year_rep>  
<total_rep>99,292</total_rep>  <tag1>xslt</tag1>  <tag2>xml</tag2>  
<tag3>xpath</tag3></topusers><topusers>  <user>Michael Kay</user>  
<link>https://stackoverflow.com/users/415448/michael-kay</link>  
<location>Reading, United Kingdom </location> <year_rep>2,256</year_rep>  
<total_rep>97,620</total_rep>  <tag1>xml</tag1>  <tag2>xslt</tag2> 
<tag3>xpath</tag3></topusers>  </group></stackoverflow>

Python (no pretty_print or tostring needed)

from lxml import etree
import sys
import re
from datetime import datetime

start_time = datetime.now()

# get input file
infile = sys.argv[1]
outfile = infile[0:infile.rindex(".")]+".trns.xml"

# get XSLT file, if it exists, else use identity transform
xsl = ''
if len(sys.argv) > 2:
    xsl = etree.parse(sys.argv[2])
else:
    # CREDIT: Dimitre Novatchev - https://stackoverflow.com/a/15697496/1422451
    xslstr ='''<xsl:stylesheet version="1.0" xmlns:x="http://www.w3.org/1999/XSL/Transform"
                xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                xmlns:xxx="xxx">
                 <xsl:namespace-alias stylesheet-prefix="xxx" result-prefix="xsl"/>
                 <xsl:output omit-xml-declaration="yes" indent="yes"/>
                 <xsl:strip-space elements="*"/>

                 <xsl:key name="kElemByName" match="*[text()[normalize-space()]]" use="name()"/>

                 <xsl:variable name="vDistinctNamedElems" select=
                 "//*[generate-id()=generate-id(key('kElemByName',name())[1])]"/>

                 <xsl:variable name="vDistinctNames">
                  <xsl:for-each select="$vDistinctNamedElems">
                   <xsl:value-of select="concat(name(), ' ')"/>
                  </xsl:for-each>
                 </xsl:variable>

                 <xsl:template match="node()|@*">
                  <xxx:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
                    <xxx:output omit-xml-declaration="yes" indent="yes"
                       cdata-section-elements="{$vDistinctNames}"/>
                    <xxx:strip-space elements="*"/>

                    <xxx:template match="node()|@*">
                     <xxx:copy>
                       <xxx:apply-templates select="node()|@*"/>
                     </xxx:copy>
                    </xxx:template>

                  </xxx:stylesheet>
                 </xsl:template>
                </xsl:stylesheet>'''

parser = etree.XMLParser(huge_tree=True, strip_cdata=False)

# transform 1: build a new xslt script with cdata elems defined
xslt = etree.XML(xslstr)
transform_function = etree.XSLT(xslt)
transformed_1 = transform_function(etree.parse(infile, parser))

# transform 2: modify source with new xslt
transform_function = etree.XSLT(transformed_1)
transformed_2 = transform_function(etree.parse(infile, parser))

# write to output
with open(outfile, 'wb') as f:
   f.write(transformed_2)

Output XML (where commenter above, Michael Kay, is included)

<stackoverflow>
  <group lang="python">
    <topusers>
      <user><![CDATA[Martijn Pieters]]></user>
      <link><![CDATA[https://stackoverflow.com/users/100297/martijn-pieters]]></link>
      <location><![CDATA[Cambridge, United Kingdom ]]></location>
      <year_rep><![CDATA[14,102]]></year_rep>
      <total_rep><![CDATA[624,972]]></total_rep>
      <tag1><![CDATA[python]]></tag1>
      <tag2><![CDATA[python-3.x]]></tag2>
      <tag3><![CDATA[python-2.7]]></tag3>
    </topusers>
    <topusers>
      <user><![CDATA[Alex Martelli]]></user>
      <link><![CDATA[https://stackoverflow.com/users/95810/alex-martelli]]></link>
      <location><![CDATA[Sunnyvale, CA]]></location>
      <year_rep><![CDATA[10,292]]></year_rep>
      <total_rep><![CDATA[565,346]]></total_rep>
      <tag1><![CDATA[python]]></tag1>
      <tag2><![CDATA[list]]></tag2>
      <tag3><![CDATA[c++]]></tag3>
    </topusers>
    <topusers>
      <user><![CDATA[unutbu]]></user>
      <link><![CDATA[https://stackoverflow.com/users/190597/unutbu]]></link>
      <location/>
      <year_rep><![CDATA[11,788]]></year_rep>
      <total_rep><![CDATA[482,061]]></total_rep>
      <tag1><![CDATA[python]]></tag1>
      <tag2><![CDATA[pandas]]></tag2>
      <tag3><![CDATA[numpy]]></tag3>
    </topusers>
  </group>
  <group lang="xslt">
    <topusers>
      <user><![CDATA[Dimitre Novatchev]]></user>
      <link><![CDATA[https://stackoverflow.com/users/36305/dimitre-novatchev]]></link>
      <location><![CDATA[United States]]></location>
      <year_rep><![CDATA[2,028]]></year_rep>
      <total_rep><![CDATA[201,945]]></total_rep>
      <tag1><![CDATA[xslt]]></tag1>
      <tag2><![CDATA[xml]]></tag2>
      <tag3><![CDATA[xpath]]></tag3>
    </topusers>
    <topusers>
      <user><![CDATA[Martin Honnen]]></user>
      <link><![CDATA[https://stackoverflow.com/users/252228/martin-honnen]]></link>
      <location><![CDATA[Germany]]></location>
      <year_rep><![CDATA[2,463]]></year_rep>
      <total_rep><![CDATA[99,292]]></total_rep>
      <tag1><![CDATA[xslt]]></tag1>
      <tag2><![CDATA[xml]]></tag2>
      <tag3><![CDATA[xpath]]></tag3>
    </topusers>
    <topusers>
      <user><![CDATA[Michael Kay]]></user>
      <link><![CDATA[https://stackoverflow.com/users/415448/michael-kay]]></link>
      <location><![CDATA[Reading, United Kingdom ]]></location>
      <year_rep><![CDATA[2,256]]></year_rep>
      <total_rep><![CDATA[97,620]]></total_rep>
      <tag1><![CDATA[xml]]></tag1>
      <tag2><![CDATA[xslt]]></tag2>
      <tag3><![CDATA[xpath]]></tag3>
    </topusers>
  </group>
</stackoverflow>

Thank you for putting this together, however this does not work for me, for the same reason it did not work for the original requestor on issue stackoverflow.com/a/15697496/1422451 The problem is, it wraps elements in CDATA sections that were not originally wrapped in CDATA. In the question you referenced, Dimitri even states: "This cannot be achieved with XSLT or XPath alone. The data model these languages use doesn't allow to distinguish any CDATA sections ..."
As I mentioned here this wraps every text node with CData in case you do not know the elements in advance. Do you know in advance the elements you want to wrap? Also, CData does not fundamentally change values but avoids escaping illegal entities. You can still parse nodes to return same exact values. Does the transformed result fail with some API you are using? Or is just an aesthetic preference?

Collectives™ on Stack Overflow

How to preserve CDATA when using XSLT for "identity transform" with lxml in python?

2 Answers 2

1 Comment

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related