0

Trying to read XML file with nested XML object with own XML declaration. As expected got exception: Unexpected XML declaration. The XML declaration must be the first node in the document, and no white space characters are allowed to appear before it.

How can i read that specific element as text and parse it as separate XML document for later deserialization?

<?xml version="1.0" encoding="UTF-8"?>
<Data>
  <Items>
    <Item>
      <Target type="System.String">Some target</Target>
      <Content type="System.String"><?xml version="1.0" encoding="utf-8"?><Data><Items><Item><surname type="System.String">Some Surname</surname><name type="System.String">Some Name</name></Item></Items></Data></Content>
    </Item>
  </Items>
</Data>

Every approach i'm trying fail due to declaration exception.

    var xml = System.IO.File.ReadAllText("Info.xml");

    var xDoc = XDocument.Parse(xml); // Exception

    var xmlDoc = new XmlDocument();
    xmlDoc.LoadXml(xml); // Exception

    var xmlReader = XmlReader.Create(new StringReader(xml));
    xmlReader.ReadToFollowing("Content"); // Exception

I have no control over XML creation.

14
  • 3
    Your xml is not 'valid' as starting tag end ending one are not the same; <Target type="System.String">Some target</SPHTarget> Commented Apr 1, 2019 at 10:59
  • 2
    Ok but still your xml is not valid. Do you have the obligation to deal with an invalid xml? Commented Apr 1, 2019 at 11:05
  • 3
    you have multiple <?xml version="1.0" encoding="utf-8"?> in the XML. That's why you are getting this error. If you can avoid it, then also you can read the complete XML from <Content>. Is it mandatory to have another xml tag inside <Content> ? Commented Apr 1, 2019 at 11:05
  • 3
    If you "have to" deal with invalid xml, you could 'hack' it to make it valid by dealing the xml as a string and by remove invalid element in it using string replacement. Commented Apr 1, 2019 at 11:08
  • 2
    @jdweng it's definitely not a white space issue it's nested declaration problem Commented Apr 1, 2019 at 11:29

2 Answers 2

1

The only way I would know is by getting rid of the illegal second <?xml> declaration. I wrote a sample that will simply look for and discard the second <?xml>. After that the string has become valid XML and can be parsed. You may need to tweak it a bit to make it work for your exact scenario.

Code:

using System;
using System.Xml;

public class Program
{
    public static void Main()
    {
        var badXML = @"<?xml version=""1.0"" encoding=""UTF-8""?>
<Data>
  <Items>
    <Item>
      <Target type=""System.String"">Some target</Target>
      <Content type=""System.String""><?xml version=""1.0"" encoding=""utf-8""?><Data><Items><Item><surname type=""System.String"">Some Surname</surname><name type=""System.String"">Some Name</name></Item></Items></Data></Content>
    </Item>
  </Items>
</Data>";

        var goodXML = badXML.Replace(@"<Content type=""System.String""><?xml version=""1.0"" encoding=""utf-8""?>"
                                   , @"<Content type=""System.String"">");

        var xmlDoc = new XmlDocument();
        xmlDoc.LoadXml(goodXML);

        XmlNodeList itemRefList = xmlDoc.GetElementsByTagName("Content");
        foreach (XmlNode xn in itemRefList)
        {
            Console.WriteLine(xn.InnerXml);
        }
    }
}

Output:

<Data><Items><Item><surname type="System.String">Some Surname</surname><name type="System.String">Some Name</name></Item></Items></Data>

Working DotNetFiddle: https://dotnetfiddle.net/ShmZCy

Perhaps needless to say: all of this would not have been needed if the thing that created this invalid XML would have applied the common rule to wrap the nested XML in a <![CDATA[ .... ]]> block.

Sign up to request clarification or add additional context in comments.

Comments

1

The <?xml ...?> processing declaration is only valid on the first line of an XML document, and so the XML that you've been given isn't well-formed XML. This will make it quite difficult to parse as is without either changing the source document (and you've indicated that's not possible) or preprocessing the source.

You could try:

  1. Stripping out the <?xml ?> instruction with regex or string manipulation, but the cure there may be worse than the disease.
  2. The HTMLAgilityPack, which implements a more forgiving parser, may work with an XML document

Other than that, the producer of the document should look to produce well-formed XML:

  1. CDATA sections can help this, but be aware that CDATA can't contain the ]]> end tag.
  2. XML escaping the XML text can work fine; that is, use the standard routines to turn < into &lt; and so forth.
  3. XML namespaces can also help here, but they can be daunting in the beginning.

1 Comment

Tnx for extended advice

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.