How to parse a XML with nested XML text

Question

Trying to read XML file with nested XML object with own XML declaration. As expected got exception: Unexpected XML declaration. The XML declaration must be the first node in the document, and no white space characters are allowed to appear before it.

How can i read that specific element as text and parse it as separate XML document for later deserialization?

<?xml version="1.0" encoding="UTF-8"?>
<Data>
  <Items>
    <Item>
      <Target type="System.String">Some target</Target>
      <Content type="System.String"><?xml version="1.0" encoding="utf-8"?><Data><Items><Item><surname type="System.String">Some Surname</surname><name type="System.String">Some Name</name></Item></Items></Data></Content>
    </Item>
  </Items>
</Data>

Every approach i'm trying fail due to declaration exception.

    var xml = System.IO.File.ReadAllText("Info.xml");

    var xDoc = XDocument.Parse(xml); // Exception

    var xmlDoc = new XmlDocument();
    xmlDoc.LoadXml(xml); // Exception

    var xmlReader = XmlReader.Create(new StringReader(xml));
    xmlReader.ReadToFollowing("Content"); // Exception

I have no control over XML creation.

Your xml is not 'valid' as starting tag end ending one are not the same; <Target type="System.String">Some target</SPHTarget> — user10864482
– user10864482, Commented Apr 1, 2019 at 10:59
Ok but still your xml is not valid. Do you have the obligation to deal with an invalid xml? — user10864482
– user10864482, Commented Apr 1, 2019 at 11:05
you have multiple <?xml version="1.0" encoding="utf-8"?> in the XML. That's why you are getting this error. If you can avoid it, then also you can read the complete XML from <Content>. Is it mandatory to have another xml tag inside <Content> ? — Chetan
– Chetan, Commented Apr 1, 2019 at 11:05
If you "have to" deal with invalid xml, you could 'hack' it to make it valid by dealing the xml as a string and by remove invalid element in it using string replacement. — user10864482
– user10864482, Commented Apr 1, 2019 at 11:08
@jdweng it's definitely not a white space issue it's nested declaration problem — Arvis
– Arvis, Commented Apr 1, 2019 at 11:29

Peter B · Accepted Answer · 2019-04-01 11:53:49Z

The only way I would know is by getting rid of the illegal second <?xml> declaration. I wrote a sample that will simply look for and discard the second <?xml>. After that the string has become valid XML and can be parsed. You may need to tweak it a bit to make it work for your exact scenario.

Code:

using System;
using System.Xml;

public class Program
{
    public static void Main()
    {
        var badXML = @"<?xml version=""1.0"" encoding=""UTF-8""?>
<Data>
  <Items>
    <Item>
      <Target type=""System.String"">Some target</Target>
      <Content type=""System.String""><?xml version=""1.0"" encoding=""utf-8""?><Data><Items><Item><surname type=""System.String"">Some Surname</surname><name type=""System.String"">Some Name</name></Item></Items></Data></Content>
    </Item>
  </Items>
</Data>";

        var goodXML = badXML.Replace(@"<Content type=""System.String""><?xml version=""1.0"" encoding=""utf-8""?>"
                                   , @"<Content type=""System.String"">");

        var xmlDoc = new XmlDocument();
        xmlDoc.LoadXml(goodXML);

        XmlNodeList itemRefList = xmlDoc.GetElementsByTagName("Content");
        foreach (XmlNode xn in itemRefList)
        {
            Console.WriteLine(xn.InnerXml);
        }
    }
}

Output:

<Data><Items><Item><surname type="System.String">Some Surname</surname><name type="System.String">Some Name</name></Item></Items></Data>

Working DotNetFiddle: https://dotnetfiddle.net/ShmZCy

Perhaps needless to say: all of this would not have been needed if the thing that created this invalid XML would have applied the common rule to wrap the nested XML in a <![CDATA[ .... ]]> block.

JT. · Accepted Answer · 2019-04-01 12:28:40Z

1

The <?xml ...?> processing declaration is only valid on the first line of an XML document, and so the XML that you've been given isn't well-formed XML. This will make it quite difficult to parse as is without either changing the source document (and you've indicated that's not possible) or preprocessing the source.

You could try:

Stripping out the <?xml ?> instruction with regex or string manipulation, but the cure there may be worse than the disease.
The HTMLAgilityPack, which implements a more forgiving parser, may work with an XML document

Other than that, the producer of the document should look to produce well-formed XML:

CDATA sections can help this, but be aware that CDATA can't contain the ]]> end tag.
XML escaping the XML text can work fine; that is, use the standard routines to turn < into < and so forth.
XML namespaces can also help here, but they can be daunting in the beginning.

answered Apr 1, 2019 at 12:28

JT.

4696 silver badges9 bronze badges

1 Comment

Arvis Over a year ago

Tnx for extended advice

Collectives™ on Stack Overflow

How to parse a XML with nested XML text

2 Answers 2

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related