2

I need some XSLT (or something - see below) to replace newlines in all attributes with an alternative character.

I am having to process legacy XML which stores all data as attributes, and uses new-lines to express cardinality. For example:

<sample>
    <p att="John
    Paul
    Ringo"></p>
</sample>

These new-lines are being replaced with whitespace when I parse the file in Java (as per the XML spec), however I am wishing to treat them as a list so this behaviour isn't particularly useful.

My 'solution' was to use XSLT to replace all newlines in all attributes with some other delimiter - but I have zero knowledge of XSLT. All examples I've seen thus far have either been very specific or have replaced node content instead of attribute values.

I have dabbled with XSLT 2.0's replace() but am having a hard time putting everything together.

Is XSLT even the correct solution? With the XSLT below:

<xsl:template match="sample/*">
    <xsl:for-each select="@*">
        <xsl:value-of select="replace(current(), '\n', '|')"/>
    </xsl:for-each>
</xsl:template>

applied to the sample XML outputs the following using Saxon:

John Paul Ringo

Obviously this format isn't what I'm after - this is just to experiment with replace() - but have the newlines already been normalised by the time we get to XSLT processing? If so, are there any other ways to parse these values as writ using a Java parser? I've only used JAXB thus far.

5
  • I have a very nasty feeling that I may need to don my rubber gloves and implement a filthy regex on the the XML string prior to parsing. Unfortunately I have no control over the XML being produced. Commented Jul 2, 2013 at 7:29
  • Actually no, that would be too horrid to consider. Commented Jul 2, 2013 at 7:35
  • If the whitespace within attribute values is semantically significant then you're not dealing with XML, and you'll need to use a non-XML tool to handle it. Per spec all newlines within an attribute value must be converted to spaces by the parser, and if you want a newline character in the value that you see after parsing then it must be escaped as a character reference (&#10;) Commented Jul 2, 2013 at 8:29
  • I don't disagree with you. The XML is exported from an application which will remain nameless. It's not entirely the application's fault, although stuffing all data in attributes is a arguably a somewhat dubious approach. I suspect the users have worked around a lack of 1:M cardinality for this particular field by using newlines which the application blindly exported unadulterated to XML. Commented Jul 2, 2013 at 9:35
  • I might do some research into any Java libraries which are designed for dubious XML - this can't be an isolated instance so I'm sure somebody out there has written a deliberately loose / forgiving parser. Commented Jul 2, 2013 at 9:37

3 Answers 3

2

It seem's to be hard to make this. As I found in Are line breaks in XML attribute values allowed? - new line character in attribute is valid but XML parser normalizes it (https://stackoverflow.com/a/8188290/1324394) so it is probably lost before processing (and thus before replacing).

Sign up to request clarification or add additional context in comments.

6 Comments

I saw that too, but I was hoping that they'd still be there for some XSLT fix-ups. I have since found jdom.org which skirts around the problem by not claiming to be an XML parser, which presumably relieves it of having to comply with the XML spec. Going to give it a shot now...
Just thinking aloud, you could do something like this replace(/data/@value, '\s{2,10}','|') - it is not absolutely correct because it relies that there would be more than one space instead of newline but it could make a job.
@JirkaŠ. no, that wouldn't work, because the XML parser collapses all consecutive whitespace in attribute values to a single space before the data gets as far as the XPath data model.
I was afraid about that but I tried in Altova and it worked. Might be it is just Altova specificity.
Ah, I see I missed the crucial sentence in the spec: "All attributes for which no declaration has been read SHOULD be treated by a non-validating processor as if declared CDATA." - so if you don't have a DTD the parser will replace newlines with spaces but won't collapse consecutive spaces to a single space.
|
1

XSLT only sees the XML after it has been processed by the XML parser, which will have done the attribute value normalization.

I think that some XML parsers have an option to suppress attribute value normalization. If you don't have access to such a parser, I think that doing a textual replace of (\r?\n) by &#x0A; prior to parsing might be your best escape route. Newlines that are escaped in this way don't get splatted by attribute value normalization.

1 Comment

Thanks Michael. After doing a reasonable amount of digging, I'm coming up with blanks trying to find a Java-based parser which allows for suppression of attribute value normalisation. Textual replacement is difficult as I have no control over the XML being produced. This means that I can't limit the replacement to attribute values.
1

I have solved(ish) the issue by preprocessing the XML with JSoup (which is a nod to @Ian Roberts's comment about parsing the XML with a non-XML tool). JSoup is (or was) designed for HTML documents, however works well in this context.

My code is as follows:

@Test
public void verifyNewlineEscaping() {
    final List<Node> nodes = Parser.parseXmlFragment(FileUtils.readFileToString(sourcePath.toFile(), "UTF-8"), "");

    fixAttributeNewlines(nodes);

    // Reconstruct XML
    StringBuilder output = new StringBuilder();
    for (Node node : nodes) {
        output.append(node.toString());
    }

    // Print cleansed output to stdout
    System.out.println(output);
}

/**
 * Replace newlines and surrounding whitespace in XML attributes with an alternative delimiter in
 * order to avoid whitespace normalisation converting newlines to a single space.
 * 
 * <p>
 * This is useful if newlines which have semantic value have been incorrectly inserted into
 * attribute values.
 * </p>
 * 
 * @param nodes nodes to update
 */
private static void fixAttributeNewlines(final List<Node> nodes) {

    /*
     * Recursively iterate over all attributes in all nodes in the XML document, performing
     * attribute string replacement
     */
    for (final Node node : nodes) {
        final List<Attribute> attributes = node.attributes().asList();

        for (final Attribute attribute : attributes) {

            // JSoup reports whitespace as attributes
            if (!StringUtils.isWhitespace(attribute.getValue())) {
                attribute.setValue(attribute.getValue().replaceAll("\\s*\r?\n\\s*", "|"));
            }
        }

        // Recursively process child nodes
        if (!node.childNodes().isEmpty()) {
            fixAttributeNewlines(node.childNodes());
        }
    }
}

For the sample XML in my question, the output of this method is:

<sample> 
    <p att="John|Paul|Ringo"></p> 
</sample>

Note that I am not using &#10; because JSoup is rather vigilant in its character escaping and escapes everything in attribute values. It also replaces existing numeric entity references with their UTF-8 equivalent, so time will tell whether or not this is a a passable solution.

1 Comment

Note that the downside of using JSoup is that it currently converts attribute names to lowercase. There is an open bug detailing this.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.