0

I faced a problem with reading the XML. The solution was found, but there are still some questions. The incorrect XML file is in encoded in UTF-8 and has appropriate mark in its header. But it also includes a char encoded in UTF-16 - 'é'. This code was used to read XML file for validating its content:

var xDoc = XDocument.Load(taxFile);

It raises exception for specified incorrect XML file: "Invalid character in the given encoding. Line 59, position 104." The quick fix is as follows:

XDocument xDoc = null;
using (var oReader = new StreamReader(taxFile, Encoding.UTF8))
{
    xDoc = XDocument.Load(oReader);
}

This code doesn't raise exception for the incorrect file. But the 'é' character is loaded as �. My first question is "why does it work?".

Another point is using XmlReader doesn't raise exception until the node with 'é' is loaded.

XmlReader xmlTax = XmlReader.Create(filePath);

And again the workout with StreamReader helps. The same question. It seems like the fix solution is not good enough, cause one day :) XML encoded in another format may appear and it could be proceed in the wrong way. BUT I've tried to process UTF-16 formatted XML file and it worked fine (configured to UTF-8).

The final question is if there are any options to be provided for XDocument/XmlReader to ignore characters encoding or smth like this.

Looking forward for your replies. Thanks in advance

4
  • How exactly is the 'é' encode in the XML? Post that part of the file. And to be sure, the header. Commented Jan 11, 2013 at 15:35
  • > file is in encoded in UTF-8 and has appropriate mark in its header. But it also includes a char encoded in UTF-16 - 'é'. So this file is not a valid XML file after all. Why does it contain that character? Commented Jan 11, 2013 at 15:37
  • <?xml version="1.0" encoding="UTF-8"?> <Tag ... name="Away From Home: Eating in Restaurant/Caf砠/> Commented Jan 11, 2013 at 15:39
  • > Migol: Yes, sure it's not a valid XML. And I have no idea of the reason as well as no opportunity to figure out it by myself. Commented Jan 11, 2013 at 15:41

1 Answer 1

4

The first thing to note is that the XML file is in fact flawed - mixing text encodings in the same file like this should not be done. The error is even more obvious when the file actually has an explicit encoding embedded.

As for why it can be read without exception with StreamReader, it's because Encoding contains settings to control what happens when incompatible data is encountered

Encoding.UTF8 is documented to use fallback characters. From http://msdn.microsoft.com/en-us/library/system.text.encoding.utf8.aspx:

The UTF8Encoding object that is returned by this property may not have the appropriate behavior for your application. It uses replacement fallback to replace each string that it cannot encode and each byte that it cannot decode with a question mark ("?") character.

You can instantiate the encoding yourself to get different settings. This is most probably what XDocument.Load() does, as it would generally be bad to hide errors by default. http://msdn.microsoft.com/en-us/library/system.text.utf8encoding.aspx

If you are being sent such broken XML files step 1 is to complain (loudly) about it. There is no valid reason for such behavior. If you then absolutely must process them anyway, I suggest having a look at the UTF8Encoding class and its DecoderFallbackProperty. It seems you should be able to implement a custom DecoderFallback and DecoderFallbackBuffer to add logic that will understand the UTF-16 byte sequence.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.