How do i read individual xml nodes from a node that contains both CDATA and xml

Question

I have a problem. I have several xml files that randomly contain nodes with both CDATA and reqular xml nodes inside. i need to read the contents of these nodes, but am unsure how to go about determining whether the node is a normal xml node, a CDATA node or a node that contains a mix of both where the CDATA portion at the beginning and end could contain anything. (i'm using xPath to reference my nodes if it helps)

line used to retrieve the textual contents of the node:

contentObj.text = contentNode.selectSingleNode("./text").text;

Example of the xml causing the problem:

<text>
     <![CDATA[<P align=center>&nbsp;</P>
          <P align=center>]]>
     <media identifier="005896523">
          <label>
               <![CDATA[NOTE]]>
          </label>
          <description>
               <![CDATA[Image for NOTE]]>
          </description>
          <comments>Update Required</comments>
     </media>
    <![CDATA[</P>
       <P>&nbsp;</P>
       <P align=left>&nbsp;</P>]]>
</text>

CDATA is just another way how to quote text. It is always part of a text node. — choroba
– choroba, Commented Oct 8, 2012 at 23:22
I understand that the CDATA is transparent then i call node.text, however in the above xml if i call node.text i get not only the first 2 lines that are contained in cdata but also the text value of any non CDATA node. I need to be able to separate the CDATA XML CDATA mixture in the example or at lease be able to identify that the node contains CDATA as it might not on other iterations of the full XML structure. — Reahreic
– Reahreic, Commented Oct 8, 2012 at 23:32
You cannot tell apart a CDATA section and its surrounding text. If there is something (an element) between them, you can. For what node do you call node.text? Note that XPath can return a nodelist if there are several text nodes. — choroba
– choroba, Commented Oct 8, 2012 at 23:37
I use the following line to read the contents of the <text> node. contentObj.text = contentNode.selectSingleNode("./text").text; It retrieves the text encapsulated within the CDATA section of the text node, however for some unknown reason there are other xml nodes within the same node, however they aren't inside a cdata section — Reahreic
– Reahreic, Commented Oct 8, 2012 at 23:43
What does (./text/text())[1], (./text/text())[2] etc. return? — choroba
– choroba, Commented Oct 8, 2012 at 23:48

LarsH · Accepted Answer · 2012-10-14 03:55:52Z

1

When you say

contentNode.selectSingleNode("./text")

this returns of course the <text> element node; but when you then ask for the

.text

property of it, you are asking for the text content of the whole <text> element, which is the concatenation of the values of all its descendant text nodes.

If you want to select a single text node, try

contentNode.selectSingleNode("./text/text()[1]").text;

I.e. select the first text node child of the <text> element, then retrieve its text property. That should give you "<P align=center> </P> <P align=center>" (as unparsed text, not XML tree) in your example.

In order to distinguish between CDATA and not-CDATA, you'll have to work around XPath, which is not designed to be able to distinguish between them. XML DOM on the other hand can, at least in some implementations. So you can try

var children = contentNode.selectNodes("./text/node()");

which will select a nodeList of all the children of the <text> element, including text nodes, element nodes, and possibly CDATA nodes. Iterate through the nodes in children and check their nodeType property to see whether it's NODE_CDATA_SECTION, NODE_TEXT, or something else.

Let us know how it goes, and whether you need further help.

Edit

I assume from the fact that you accepted this answer that you were able to get things working, and I'm glad you were able to.

However, I don't want to let this go without emphasizing the caveat that @choroba was alluding to: a CDATA wrapper (around a chunk of text) is invisible to most XML tools (though the text content is visible). The XML data model (described informally here) doesn't know anything about CDATA sections. The standard for XML Infoset explicitly omits information about the boundaries of CDATA marked sections.

So, while you "got lucky" this time, in that you were using XML DOM which does provide information about CDATA sections, it is against the spirit of XML (and therefore unwise) to rely on that information to encode significant data in XML. For that reason, you would be well-served to encode that information some other way. Otherwise, if you ever need to use other XML tools on the data, you could get stuck.

I think the significant information you're trying to extract here is the fact that the text in the CDATA sections is escaped markup. E.g. it's pieces of HTML tags that are not supposed to be (or can't be) part of the XML tree. So you could encode that identification by surrounding each one with a custom element:

<text>
     <escaped><![CDATA[<P align=center>&nbsp;</P>
          <P align=center>]]></escaped>
     <media identifier="005896523">
     ...

Then in order to find these sections in the future, all you have to do is look for elements named <escaped>, which is a simple and natural task for any XML tool.

I don't know whether the design of these XML files is under your control or not. If not, you at least should have the option of sending feedback to the designer. If a designer who is not well-versed in XML things makes a design mistake, it's in their best interests to know about it, so that they might be able to correct it, or at least avoid the same mistake in future designs. If you're working under a chain of command, and the designer of the XML is in a different department, the appropriate route for feedback might be through your supervisor. It's in the department's best interest to know if they're producing non-portable XML designs.

edited Oct 14, 2012 at 3:55

answered Oct 9, 2012 at 11:11

LarsH

28.1k9 gold badges99 silver badges162 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Reahreic Over a year ago

In every XML document i've created the cdata doesn't share the same node as any other nodes. Got to love working with what someone else created. Sadly the software that created the original data export (openXML specification) i'm parsing is from a well established organization, who as I've come to find out "messes with" the way data exports in order to generate additional income from consulting.

Reahreic Over a year ago

Thank you for your guidance. As i'm still learning certain programming practices, its posts like yours that ensure i don't learn the bad practices. I only hope that when i become better i will be able to give back to the community by guiding others like you have.

LarsH Over a year ago

@Reahreic: thanks for your comments. Can you give me a pointer to the relevant OpenXML specification? I would like to verify that the CDATA sections really are significant, vs. whether just finding text nodes would do the trick.

Reahreic Over a year ago

i couldn't tell you which open XML spec they're exporting to only that they refer to the ability to import and export data to and from your applications using open xml.

Collectives™ on Stack Overflow

How do i read individual xml nodes from a node that contains both CDATA and xml

1 Answer 1

Edit

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Edit

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related