2

I am getting this error when parsing an incorrectly-generated XML document:

org.xml.sax.SAXParseException: The value of attribute "bar" associated with an element type "foo" must not contain the '<' character.

I know what is causing the problem. It is this line:

<foo bar="x<y">42</foo>

It should have been

<foo bar="x&lt;y">42</foo>

I am aware that this is not valid XML, but my code has to download and parse similar files unattended and for political reasons it might not be possible to persuade the supplier to fix the faulty program, especially when other programs are reading the file and tolerating this error.

Is there any way to configure Xerces to tolerate it? At present it treats it as a fatal error. Implementing an ErrorHandler to ignore it is not satisfactory because then the remainder of the document is not parsed.

Alternatively can you suggest another stream-based parser that can be configured to tolerate this error? Using a DOM parser is not feasible as these documents run into hundreds of megabytes.

2
  • 2
    This is a political problem. It requires a political solution, not a technical one. Commented Jul 23, 2010 at 7:25
  • 1
    Xerces might not tolerate it, but an alternative library like jsoup (jsoup.org) may be a better fit in this case. It looks like it was originally designed for HTML, but I've used it to successfully read data from buggy XML. stackoverflow.com/questions/9886531/how-to-parse-xml-with-jsoup Commented Oct 17, 2016 at 21:43

2 Answers 2

5

... and for political reasons it might not be possible to persuade the supplier to fix the faulty program ...

For political reasons you ought to try your damnedest to get them to fix it. Wave the requirements specification in front of them that says that the input must be well-formed XML. Threaten to bill them for the cost of developing a bespoke parser. (OK, that probably won't work ...)

By giving up without a fight, you are just leaving the problem to trouble other people who have to deal with this supplier in the future.

Sign up to request clarification or add additional context in comments.

Comments

4

I don't think you will find any XML parsers that will tolerate this sort of error. The only thing I can suggest is that you pre-process the XML to remove errors that might occur.

4 Comments

The funny thing about pre-processing, if you think about it being used in the example presented, is that you need to understand the various XML contexts like the start of a new node etc. In other words, some basic XML parsing logic itself needs to be done, that can recognize the need for XML encoding in a context and apply it. It looks like it would involve writing a "tolerant" XML parser just like the OP wants.
@Vineet. The problem is that it cannot be done in a general way. The preprocessing needs to be done based on @finnw's knowledge of what the vendor XML ought to look like, a systematization of the observed mistakes and then pattern-based matching and correction. Without using @finnw's knowledge, there could be many possible corrections and no way for a hypothetical error tolerant parser to pick the right one.
@Stephen, that's precisely why pre-processing cannot be a blanket solution. One can use regexes and other schemes to discover these errors, but if they're random and could occur in any attribute value or text node, then what?
@Vineet - obviously. Truly random errors are hard to deal with. And any sufficiently broken XML cannot be fixed by any means. But an (ideal) human directed corrector will do better than an (ideal) corrector that takes no human direction ... because in practice errors are likely to be (somewhat) systematic rather than truly random. For instance, the error in the example is most likely caused by some broken software that doesn't escape special characters in values of certain attributes. Knowing this, you can design your regexes to match those attributes and correct their values.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.