How default is the default encoding (UTF-8) in the XML Declaration?

Question

I know that the default encoding of XML is UTF-8. All XML consumers MUST and so on and so forth. So this is not just a question whether or not XML has a default encoding.

I also know that the XML-Declarataion <?xml version="1.0" ... ?> at the beginning of the document itself is optional. And that specifying the encoding therein is optional as well.

So I ask myself if the following two XML-Declarations are two expressions for the exact same thing:

<?xml version="1.0"?>
<?xml version="1.0" encoding="UTF-8"?>

From my own current understanding I would say those are equivalent but I do not know. Has the equivalence of these two declarations been specified somewhere?

(Consider these two example lines being each the first line of an XML document, preceded by any (zero) bytes and being UTF-8 encoded)

Fortunately UTF-8 is the default per sé. When reading an XML document and writing it in another encoding, mostly this attribute will be patched too. Entirely unproblematic, and I cannot imagine why one so often sees the encoding attribute. The version is important though; higher versions allow tag names like <café>. — Joop Eggen
– Joop Eggen, Commented May 3, 2013 at 15:08
I'm not asking this because I have a problem with character encoding here. I'm just wondering as those look same to me, if this is specified or not. So that it's possible to test my software for conformance. — hakre
– hakre, Commented May 3, 2013 at 15:38

Peter O. · Accepted Answer · 2013-05-15 20:06:06Z

The Short Answer

Under the very specific circumstances of a UTF-8 encoded document with no external encoding information (which I understand from the comments is what you're interested in), there is no difference between the two declarations.

The long answer is far more interesting though.

What The Spec Says

If you look at Appendix F1 of the XML specification, that explains the process that should be followed to determine the encoding when there is no external encoding information.

If the document is encoded as one of the UTF variants, the parser should be able to detect the encoding within the first 4 bytes, either from the Byte Order Mark, or the start of the XML declaration.

However, according to the spec, it should still read the encoding declaration.

In cases above which do not require reading the encoding declaration to determine the encoding, section 4.3.3 still requires that the encoding declaration, if present, be read and that the encoding name be checked to match the actual encoding of the entity.

If they don't match, according to section 4.3.3:

...it is a fatal error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration

Encoded UTF-16, Declared UTF-8

Let's see what happens in reality when we create an XML document encoded as UTF-16 but with the encoding declaration set to UTF-8.

Opera, Firefox and Chrome all interpret the document as UTF-16, ignoring the encoding declaration. Internet Explorer (version 9 at least), displays a blank document, but no actual error.

So if you include a UTF-8 encoding declaration on your UTF-8 document and someone at a later stage converts it to UTF-16, it'll work in most browsers, but fail in IE (and, I assume, most Microsoft XML APIs). If you had left the encoding declaration off, you would have been fine.

Technically I think IE is the most accurate. The fact that it doesn't display an error as such might be explained by the fact that the error is occurring at the encoding level rather than the XML level. It is assumedly doing its best to interpret the UTF-16 characters as UTF-8, failing to find any characters that decode, and ending up passing on an empty character sequence to the XML parser.

Encoded UTF-8, Declared Otherwise

You might now think that Firefox, Chrome and Opera are just ignoring the encoding declaration altogether, but that's not always the case.

If you encode a document as UTF-8 (with a byte order marker so it's unmistakable as anything else), but set the encoding declaration to Latin1, all of the browsers will successfully decode the content as Latin1, ignoring the UTF-8 BOM.

Again this seems right to me. The fact that the BOM characters aren't valid in Latin1 just means they are silently dropped at the character decoding level.

This doesn't work for all declared encodings on a UTF-8 document though. If the declared encoding is UTF-16, we're back with Opera, Firefox and Chrome ignoring the declared encoding, while Internet Explorer returns a blank document.

Essentially, anything that makes IE return a blank document is going to make other browsers ignore the declared encoding.

Other Inconsistencies

It's also worth mentioning the importance of the Byte Order Mark. According to section 4.3.3 of the spec:

Entities encoded in UTF-16 MUST [...] begin with the Byte Order Mark

However, if you try and read a UTF-16 encoded XML document without a BOM, most browsers will nevertheless accept it as valid. Only Firefox reports it as an XML Parsing Error.

External Encoding Information

Up to now, we've been considering what happens when there is no external encoding information, but, as others have mentioned, if the document is received via HTTP or enclosed in a MIME envelope of some sort, the encoding information from those sources should take preference over the document encoding.

Most of the details for the various XML MIME types are described in RFC3023. However, the reality is somewhat different from what is specified.

First of all, text/xml with an omitted charset parameter should use a charset of US-ASCII, but that requirement has almost always been ignored. Browsers will typically use the value of the XML encoding declaration, or default to UTF-8 if there is none.

Second, if there is a UTF-8 BOM on the document, and the XML encoding declaration is either UTF-8 or not included, the document will be interpreted as UTF-8, regardless of the charset used in the Content-Type.

The only time the encoding from the Content-Type seems to take precedence is when there is no BOM and an explicit charset is specified in the Content-Type.

In any event, there are no cases (involving Content-Type) where including a UTF-8 XML encoding declaration on a UTF-8 document is any different from not having an encoding declaration at all.

The spec does say in section 4.3.3 what should happen if the encodings don't match: "In the absence of information provided by an external transport protocol (e.g. HTTP or MIME), it is a fatal error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration [...]" And later: "It is a fatal error if an XML entity is determined (via default, encoding declaration, or higher-level protocol) to be in a certain encoding but contains byte sequences that are not legal in that encoding."

Francis Avila · Accepted Answer · 2013-05-03 15:28:44Z

9

+100

In isolation, both are equivalent. You have already cited the relevant parts of the specifications which show that both declarations are equivalent.

However XML can have an envelope, such as the HTTP Content-Type header. The W3C specifies that this envelope information has priority over any other declarations in the file. So for example, if you are retrieving XML via http, you could potentially get this:

HTTP/1.1 200 OK
Content-Type: text/xml

<root/>

In this case, the XML should be read as ascii, because the default charset for text/* mime types is ascii. This is why you should use application/xml mime types--these default to utf-8. The "application" prefix means that the relevant application specifications define things like default encoding. (I.e. the XML spec takes over.) With text/* mime types, the default is ascii and the charset parameter must be included in the mime type to change charset.

Here's another case:

HTTP/1.1 200 OK
Content-Type: text/xml; charset=win-1252

<?xml version="1.0" encoding="utf-8"?>
<root/>

In this case, a conforming XML processor should read this file as win-1252, not utf-8.

Another case:

HTTP/1.1 200 OK
Content-Type: application/xml

<?xml version="1.0" encoding="win-1252"?>
<root/>

Here the encoding is win-1252.

HTTP/1.1 200 OK
Content-Type: application/xml; charset=ascii

<?xml version="1.0" encoding="win-1252"?>
<root/>

Here the encoding is ascii.

answered May 3, 2013 at 15:28

Francis Avila

31.8k7 gold badges63 silver badges99 bronze badges

18 Comments

Francis Avila Over a year ago

In other words, once you have a DOM, the encoding of the original document doesn't matter. encoding is what the declaration said (like standalone or version) and actualEncoding is how the parser parsed it, but all the strings are already converted from the document encoding to native strings.

Francis Avila Over a year ago

I'm not sure what you mean. There are two different DOM attributes as I said. One is data from the processing instruction, the other is from the parser. If the XML document does not have encoding="...." written in it, DOMDocument.encoding has no value, but DOMDocument.actualEncoding will have the encoding the parser used to parse the document. That's all that is occurring here. Remember that the XML spec is not DOM-centric, so you should not put much stock in exact DOM equality when comparing documents.

Francis Avila Over a year ago

Yes, but the DOM chose to distinguish between "encoding param in the xml declaration" and "encoding the parser used", since as you can see these may be different. The XML infoset (vs the DOM) does not contain the encoding param in the xml declaration and only contains "what the parser used" information. However both can have standalone and version have no value if there was no xml declaration. I'm not sure why this is so mind-blowing to you!

Francis Avila Over a year ago

This is not an xml parsing issue (the xml was already parsed) but the vulgarities of your particular DOM library. Once again, xmlEncoding is not standard.

Francis Avila Over a year ago

You're right, it's xmlEncoding and inputEncoding. Not sure now where I was reading that it was encoding and actualEncoding. You can see in the infoset mapping that xmlEncoding/encoding='...' is not part of the XML infoset.

|

Joe · Accepted Answer · 2013-05-03 15:48:36Z

5

+50

It would not be unreasonable for the second declaration to be rejected if it arrived at the start of a document that had already been detected as having a non-UTF-8 compatible encoding (such as UTF-16). However, given your statement that the document is UTF-8 encoded, there is no difference between how they would be treated.

An externally-specified encoding would take precedence in both cases; both documents would still be treated identically.

answered May 3, 2013 at 15:48

Joe

31.4k13 gold badges77 silver badges101 bronze badges

2 Comments

hakre Over a year ago

Thanks for taking the time to answer but I have two problem here: First the section you link is not normative. Second - and more important - you write about character encoding of the input string and guessing that. My question is not about that, it is about the XML-Encoding, the declared one. And in how far the missing declaration is not a missing declaration at all.

Joe Over a year ago

The answer to your question is yes, they are the same. I'm writing about detecting encoding to provide more detail for a very similar case, even though it's more general than what you're asking. I think 4.3.3's normative "it is a fatal error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration, or for an entity which begins with neither a Byte Order Mark nor an encoding declaration to use an encoding other than UTF-8" confirms this.

nwellnhof · Accepted Answer · 2013-05-05 21:23:19Z

2

The way I read the spec, UTF-8 is not the default encoding in an XML declaration. It is only the default encoding "for an entity which begins with neither a Byte Order Mark nor an encoding declaration". If a document is in UTF-16 and has a BOM, it may have an XML declaration without an encoding declaration or no XML declaration at all and still be valid XML.

Only for documents without a BOM, the two XML declarations you mentioned should be equivalent.

answered May 5, 2013 at 21:23

nwellnhof

34k7 gold badges97 silver badges121 bronze badges

2 Comments

hakre Over a year ago

Which is why you find at the end of the question "(Consider these two example lines being each the first line of an XML document, preceded by any (zero) bytes and being UTF-8 encoded)" :)

hakre Over a year ago

No problem, this can just happen.

Collectives™ on Stack Overflow

How default is the default encoding (UTF-8) in the XML Declaration?

4 Answers 4

1 Comment

18 Comments

2 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

18 Comments

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related