In a follow up to my last question, if you have a string that is malformed in an xml file, you can extract the contents using preg_replace_callback() to remove the elements that break.
The point of this function is not to parse the xml with regex (a bad idea), but to try to find xml that doesn't parse and where it fails so that we can flag articles that aren't being correctly formatted before being sent out. This is part of a set of tools to clean content before delivery. I am testing it on known malformed public RSS urls as well as internal ones to see if it caters for a number of situations. The callback will return an integer for the node that failed. If it passes after that, we can report the index of the article and then try to use DOMDocument to try to correct the html and try again. If it fails, we'll report it as a critical, otherwise, we return the parsing article description and content back to the database, marking it as modified before delivery.
You can then take the broken elements and run them through DOMDocument to format them better to return to the XML file.
However, I'm stuck on how to make this example below return other than false:
Sample XML:
<item>
<content:encoded><![CDATA[
This is the text with odd characters that are killing
simplexml_load_string() (doesn't recover) and breaking
(although recoverable) DOMDocument
]]></content:encoded>
</item>
If I use the following PHP, I can extract a description node and convert it from:
<description><![CDATA[
This is some description text with the same problem
]]></description>
to
<description>0</description>
PHP:
preg_replace_callback(
'/<description>(.*)<\/description>/', **// add msU modifiers to fix below**
'node_tidy::callback_description',
$xml
);
...
private function callback_description($matches=false) {
if(false !== $matches) {
$this->arrDescriptions[] = $matches[1];
return '<description>'.$this->indexDescriptions++.'</description>';
} else {
return false;
}
}
However, when I try to do the same with content:encoded nodes, it returns false. Here's the related function:
private function callback_content_encoded($matches=false) {
if(false !== $matches) {
$this->arrContentEncoded[] = $matches[1];
return '<content:encoded>'.$this->indexContentEncoded++.'</content:encoded>';
} else {
return false;
}
}
Using a straight regex, to test if it's the colon, I used this:
<?php
$string = '<content:encoded>this is some text</content:encoded>';
preg_match('/<content\:encoded>(.*)<\/content\:encoded>/',$string,$matches);
echo '<pre>';
print_r($matches);
echo '</pre>';
?>
However, that did not print the expected array with or without adding \:. Could someone point me in the right direction for the misunderstanding here?
Many thanks!
UPDATE: Here's a sample snippet of the real xml that fails, as indicated by @Florent.
UPDATE: This regex matches the required content:
preg_match('/<content\:encoded>(.*)<\/content\:encoded>/msU',$string,$matches);
The m and s and U modifiers are explained better here: http://www.php.net/manual/en/reference.pcre.pattern.modifiers.php
I neglected to consider these modifiers.
The results are now brought back by this regex, including the original problem, so this can now be resolved.
mflag to your pattern?