2

I have some code which reads in an XML file, formats it and then outputs it again to the same file. However if there is no encoding defined, the output XML has UTF-8 defined.

For example:

<?xml version="1.0"?>

becomes:

<?xml version="1.0" encoding="UTF-8"?>

I was wondering if there was any way to preserve whatever encoding (or lack of encoding) that was there before?

Here is my current code:

DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = docFactory.newDocumentBuilder();
Document document = docBuilder.parse(file);

OutputFormat format = new OutputFormat(document);
format.setLineWidth(65);
format.setIndenting(true);
format.setIndent(2);

Writer out = new StringWriter();
XMLSerializer serializer = new XMLSerializer(out, format);
serializer.serialize(document);

//custom method to write file
writeFile(filePath, out.toString());

Any help is appreciated. Thanks.

3 Answers 3

4

OutputFormat has a setEncoding(String) method. Use it that way:

format.setEncoding(document.getXmlEncoding());

This will keep the original encoding of the document in the output document preamble. However, if the original encoding of the document was unset, the document.getXmlEncoding() return null and the Javadoc for OutpoutFormat.setEncoding(String) doesn't specify how the method behaves when given null.

Of course, your custom method to write to file will need to take the encoding as a parameter, because it is illegal to specify an encoding in the preamble and use another one when writing to the file.

As a side note, in XML, the UTF-8 encoding is the default. So omitting the encoding in the preamble or specifying UTF-8 has the same meaning.

Sign up to request clarification or add additional context in comments.

1 Comment

This worked perfectly. If I define the encoding, it takes it and outputs that exact one; if I don't define it will not output the default. Thanks for the detailed explanation. It was a big help. Also, I will be using some custom interpreters for XML and I've encountered some places where it treats encoding-specified and encoding-ambiguous files differently. So while formatting I wanted to keep them as they were before. Thanks again!
1

You can use Document.getEncoding and pass that as an constructor argument to the OutputFormat class's overloaded constructor.

Comments

-1

by default StreamWriter is created for using UTF-8 without preamble. See details here

1 Comment

Not sure how to write this without sounding rude, but I'm using Java and StringWriter, not C# and StreamWriter. How are they related?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.