2

I have a super simple XML document encoded in UTF-16 LE.

<?xml version="1.0" encoding="utf-16"?><X id="1" />

I'm loading it in as such (using jcabi-xml):

BOMInputStream bomIn = new BOMInputStream(Main.class.getResourceAsStream("resources/test.xml"), ByteOrderMark.UTF_16LE);
String firstNonBomCharacter = Character.toString((char)bomIn.read());
Reader reader = new InputStreamReader(bomIn, "UTF-16");
String xmlString = IOUtils.toString(reader);
xmlString = xmlString.trim();
xmlString = firstNonBomCharacter + xmlString;
bomIn.close();
reader.close();
final XML xml = new XMLDocument(xmlString);

I have checked that there are no extra BOM/junk symbols (leading or anywhere) by saving out the file and inspecting it with a hex editor. The XML is properly formatted.

However, I still get the following error:

[Fatal Error] :1:40: Content is not allowed in prolog.
Exception in thread "main" java.lang.IllegalArgumentException: Invalid XML: "<?xml version="1.0" encoding="utf-16"?><X id="1" />"
    at com.jcabi.xml.DomParser.document(DomParser.java:115)
    at com.jcabi.xml.XMLDocument.<init>(XMLDocument.java:155)
    at Main.getTransformedString(Main.java:47)
    at Main.main(Main.java:26)
Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 40; Content is not allowed in prolog.
    at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown Source)
    at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unknown Source)
    at javax.xml.parsers.DocumentBuilder.parse(Unknown Source)
    at com.jcabi.xml.DomParser.document(DomParser.java:105)
    ... 3 more

I have googled up and down for this error but they all say that it's the BOM's fault, which I have confirmed (to the best of my knowledge) to not be the case. What else could be wrong?

11
  • If the file is in UTF-16, then shouldn't the BOM be reserving the first two bytes of the file? What if you add another bomIn.read(); for discarding the second byte? Commented Mar 17, 2016 at 20:47
  • Actually, now that I had a look at the JavaDocs for BOMInputStream, you should remove the bomIn.read() call altogether because the stream discards the BOM for you. Commented Mar 17, 2016 at 20:53
  • @MickMnemonic That's what I thought too, but when I don't call bomIn.read() my string turns into something made of nothing but questions marks. Truthfully I'm not too sure exactly how to use BOMInputStream but this answer (stackoverflow.com/questions/1835430/…) writes that calling read skips to the first non-bom character (which I forgot to include in my sample code). Commented Mar 17, 2016 at 20:57
  • If you're consuming the BOM, then the InputStreamReader should be told about the endiannness: Reader reader = new InputStreamReader(bomIn, StandardCharsets.UTF_16LE); Commented Mar 17, 2016 at 21:06
  • @MickMnemonic Now that allows things to work without calling bomIn.read(), thanks! However the actual error itself persists. Commented Mar 17, 2016 at 21:10

1 Answer 1

2

The following works for me:

    try (InputStream stream = Test.class.getResourceAsStream("/Test.xml")) {
        StreamSource source = new StreamSource(stream);
        final XML xml = new XMLDocument(source);
    }

With the input file's hex dump:

FF FE 3C 00 3F 00 78 00 6D 00 6C 00 20 00 76 00 65 00 72 00 73 00 69 00  
6F 00 6E 00 3D 00 27 00 31 00 2E 00 30 00 27 00 20 00 65 00 6E 00 63 00 
6F 00 64 00 69 00 6E 00 67 00 3D 00 27 00 55 00 54 00 46 00 2D 00 31 00 
36 00 27 00 3F 00 3E 00 3C 00 58 00 20 00 69 00 64 00 3D 00 22 00 31 00 
22 00 2F 00 3E 00

As far as I can tell, in your example you are converting the contents of the file to a string. But this is problematic because you actually throw away the encoding when you convert bytes to string. When the SAX parser converts the string to a byte array, it decides it will be UTF-8, but the prolog states that it is UTF-16 and so you have a problem.

Instead, when I use the StreamSource, it just automatically detects the fact that the file is encoded in UTF-16 LE from the BOM.

If you are not using java-7 or up and cannot use try-with-resources, then use the stream.close() as before.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.