4

I am working on modifying the contents of an XML file generated by some other library. I'm making some DOM modifications with PHP (5.3.10) and reinserting a replacement node.

The XML data I'm working with has " elements before I do the manipulation and I want to keep those elements as per http://www.w3.org/TR/REC-xml/ when I'm done with the modifications.

However I'm having problems with PHP changing the " elements. See my example.

$temp = 'Hello "XML".';
$doc = new DOMDocument('1.0', 'utf-8');
$newelement = $doc->createElement('description', $temp);
$doc->appendChild($newelement);
echo $doc->saveXML() . PHP_EOL; // shows " instead of element
$node = $doc->getElementsByTagName('description')->item(0);
echo $node->nodeValue . PHP_EOL; // also shows "

Output

<?xml version="1.0" encoding="utf-8"?> 
<description>Hello "XML".</description>

Hello "XML".

Is this a PHP error or am I doing something wrong? I hope it isn't necessary to use createEntityReference in every char location.

Similar Question: PHP XML Entity Encoding issue


EDIT: As an example to show saveXML should not be converting the &quot; entities just like the &amp; which behaves properly. This $temp string should really be output as it is initially entered with the entities during saveXML().

$temp = 'Hello &quot;XML&quot; &amp;.';
$doc = new DOMDocument('1.0', 'utf-8');
$newelement = $doc->createElement('description', $temp);
$doc->appendChild($newelement);
echo $doc->saveXML() . PHP_EOL; // shows " instead of element like &amp;
$node = $doc->getElementsByTagName('description')->item(0);
echo $node->nodeValue . PHP_EOL; // also shows " &

Output

<?xml version="1.0" encoding="utf-8"?>
<description>Hello "XML" &amp;.</description>

Hello "XML" &.
6
  • Maybe this is of some use? Interesting - I created a new DOMText($temp); as a text node then appended that to $newelement (an empty <description> node, and the result I got was almost right: <description>Hello &amp;quot;XML&amp;quot;.</description> Commented Feb 8, 2015 at 21:48
  • @MichaelBerkowski That is interesting. If you used my string $temp which was already encoded, then your method double encoded it, but it did kept the encoding properly during saveXML. Can you describe more about what you're doing? I get a 'Invalid Character Error' when I try the DOMText. Commented Feb 9, 2015 at 4:21
  • I don't see what's wrong with having double quotes unencoded in an element's node value? They get escaped only when inside attribute values. Commented Feb 9, 2015 at 4:21
  • @Ja͢ck the XML spec is for double quotes to be encoded inside any text node. Commented Feb 9, 2015 at 4:27
  • Well, the spec only mentions & and < to require escaping in the contents; escaping of single and double quotes is only applicable in attributes. Commented Feb 9, 2015 at 4:30

1 Answer 1

1

The answer is that it doesn't actually need any escaping according to the spec (skipping the mentions of CDATA):

The ampersand character (&) and the left angle bracket (<) must not appear in their literal form (...) If they are needed elsewhere, they must be escaped using either numeric character references or the strings " &amp; " and " &lt; " respectively. The right angle bracket (>) may be represented using the string " &gt; " (...)

To allow attribute values to contain both single and double quotes, the apostrophe or single-quote character (') may be represented as " &apos; ", and the double-quote character (") as " &quot; ".

You can verify this easily by using createTextNode() to perform the correct escaping:

$dom = new DOMDocument;
$e = $dom->createElement('description');
$content = 'single quote: \', double quote: ", opening tag: <, ampersand: &, closing tag: >';
$t = $dom->createTextNode($content);
$e->appendChild($t);
$dom->appendChild($e);

echo $dom->saveXML();

Output:

<?xml version="1.0"?>
<description>single quote: ', double quote: ", opening tag: &lt;, ampersand: &amp;, closing tag: &gt;</description>
Sign up to request clarification or add additional context in comments.

3 Comments

This is interesting because I assumed that the XML internals in WordPress were encoding correctly and I was trying to duplicate their XML when returning processing it with their code. I've run into bugs with PHP/entities that I assumed something was amiss and WordPress had it right. I will have to look into their code to see how they are producing these entities in their XML. For whatever reason it is causing me issues with how WP parses the XML I've modified.
One quick question. When you mention encoding is 'only applicable in attributes' does this include embeded html attributes? For example if the text node has html tags with attributes in it?
A text node can't have html tags, and as such the opening tag must be escaped.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.