1

I'm trying to parse an xml file from an external source which contains invalid UTF-8 bytes

enter image description here

Using the following java code

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setValidating(false);
factory.setIgnoringComments(true);
factory.setNamespaceAware(false);
DocumentBuilder documentBuilder = factory.newDocumentBuilder();
try (InputStream in = getMyInputStream()) {
    Document doc = documentBuilder.parse(new InputSource(in));
    ...
}

And I'm getting the following exception

Caused by: org.xml.sax.SAXParseException: Invalid byte 2 of 3-byte UTF-8 sequence.
    at java.xml/com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:262)
    at java.xml/com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:339)
    ... 10 common frames omitted
Caused by: com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 2 of 3-byte UTF-8 sequence.
    at java.xml/com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.invalidByte(UTF8Reader.java:702)
    at java.xml/com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.read(UTF8Reader.java:409)
    at java.xml/com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.load(XMLEntityScanner.java:1904)
    at java.xml/com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.peekChar(XMLEntityScanner.java:508)
    at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2649)
    at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:605)
    at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:534)
    at java.xml/com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:888)
    at java.xml/com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:824)
    at java.xml/com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
    at java.xml/com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:246)

I realise that the XML contains an invalid UTF-8 character but I'd like the XML parser to gracefully handle this rather than throwing an exception

3
  • 1
    Just generally: we as an industry should try to push against accepting broken XML (and other formats). I know that sometimes workarounds like this are necessary, but it could help us all lose a bit less sanity if we could just say "hey, your input is malformed, fix it". Please push as much as you can every time something like this comes up. It might not always work, but the pushing is useful work. Commented May 4, 2021 at 8:51
  • Agreed, and I've asked the client to fix it on their end. In the meantime I can push a fix to production quicker than the client can, hence the workaround. I've solved my problem and just posting here since I couldn't find this problem/solution when I initially googled it. Commented May 4, 2021 at 9:04
  • 1
    @JoachimSauer I know exactly what you mean. But when dealing with legacy documents you can't push anything. It should be an option to the end user to gracefully ignore errors and still read/interpret as much as possible, instead of rejecting the entire content forever. Commented May 4, 2021 at 9:50

1 Answer 1

2

I solved this by passing a java.io.Reader to the DocumentBuilder instead of a java.io.InputStream. So now the DocumentBuilder is acting upon a stream of characters instead of a stream of bytes and does not attempt to validate the bytes and hence does not throw exceptions. The byte to character transformation is now done by the InputStreamReader

So I changed

try (InputStream in = getMyInputStream()) {
   Document doc = documentBuilder.parse(new InputSource(in));
   ...
}

To

try (Reader reader = new InputStreamReader(getMyInputStream(), StandardCharsets.UTF_8)) {
   Document doc = documentBuilder.parse(new InputSource(reader));
   ...
}
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.