Regex for colon in an xml tag when parsing fails with php and simplexml_load_string

Question

In a follow up to my last question, if you have a string that is malformed in an xml file, you can extract the contents using preg_replace_callback() to remove the elements that break.

The point of this function is not to parse the xml with regex (a bad idea), but to try to find xml that doesn't parse and where it fails so that we can flag articles that aren't being correctly formatted before being sent out. This is part of a set of tools to clean content before delivery. I am testing it on known malformed public RSS urls as well as internal ones to see if it caters for a number of situations. The callback will return an integer for the node that failed. If it passes after that, we can report the index of the article and then try to use DOMDocument to try to correct the html and try again. If it fails, we'll report it as a critical, otherwise, we return the parsing article description and content back to the database, marking it as modified before delivery.

You can then take the broken elements and run them through DOMDocument to format them better to return to the XML file.

However, I'm stuck on how to make this example below return other than false:

Sample XML:

<item>
    <content:encoded><![CDATA[
        This is the text with odd characters that are killing 
        simplexml_load_string() (doesn't recover) and breaking 
        (although recoverable) DOMDocument
    ]]></content:encoded>
</item>

If I use the following PHP, I can extract a description node and convert it from:

<description><![CDATA[
    This is some description text with the same problem
]]></description>

to

<description>0</description>

PHP:

preg_replace_callback(
    '/<description>(.*)<\/description>/', **// add msU modifiers to fix below**
    'node_tidy::callback_description',
    $xml
);

...

private function callback_description($matches=false) {
    if(false !== $matches) {
        $this->arrDescriptions[] = $matches[1];
        return '<description>'.$this->indexDescriptions++.'</description>';
    } else {
        return false;
    }
}

However, when I try to do the same with content:encoded nodes, it returns false. Here's the related function:

private function callback_content_encoded($matches=false) {
    if(false !== $matches) {
        $this->arrContentEncoded[] = $matches[1];
        return '<content:encoded>'.$this->indexContentEncoded++.'</content:encoded>';
    } else {
        return false;
    }
}

Using a straight regex, to test if it's the colon, I used this:

<?php

$string = '<content:encoded>this is some text</content:encoded>';
preg_match('/<content\:encoded>(.*)<\/content\:encoded>/',$string,$matches);

echo '<pre>';
print_r($matches);
echo '</pre>';

?>

However, that did not print the expected array with or without adding \:. Could someone point me in the right direction for the misunderstanding here?

Many thanks!

UPDATE: Here's a sample snippet of the real xml that fails, as indicated by @Florent.

http://pastebin.com/7z0f3MJP

UPDATE: This regex matches the required content:

preg_match('/<content\:encoded>(.*)<\/content\:encoded>/msU',$string,$matches);

The m and s and U modifiers are explained better here: http://www.php.net/manual/en/reference.pcre.pattern.modifiers.php

I neglected to consider these modifiers.

The results are now brought back by this regex, including the original problem, so this can now be resolved.

Second Downvote without any explanation. Care to explain why so I can adjust the question accordingly? — MyStream
– MyStream, Commented Jul 9, 2012 at 15:12
Why to you say that your regex did not print the expected array? I tried it and got the node content. — Florent
– Florent, Commented Jul 9, 2012 at 15:18
Perhaps my use-case is too narrow - I'll post the exact string I'm attempting to parse, and perhaps the answer lies in there. Please check again? — MyStream
– MyStream, Commented Jul 9, 2012 at 15:21
Maybe your content is multiline. Did you try to add the m flag to your pattern? — Florent
– Florent, Commented Jul 9, 2012 at 15:26
Ah - multiline - I think it would be in most cases a multiline content to match against. I've updated the regex in the question to multiline and added s for the dot as here: php.net/manual/en/reference.pcre.pattern.modifiers.php — MyStream
– MyStream, Commented Jul 9, 2012 at 15:54

Florent · Accepted Answer · 2012-07-09 16:01:44Z

1

You should add the following flags to your regex:

m to enable multiline strings
u to enable UTF8 strings (if necessary)

answered Jul 9, 2012 at 16:01

Florent

12.4k10 gold badges51 silver badges58 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

MyStream Over a year ago

Thank you :)- in particular, I found the m and u to both be essential in getting the xml content out from malformed xml nodes with CDATA wrappers. Spot on.

user557597 · Accepted Answer · 2012-07-09 18:17:05Z

0

The multi-line modifier is not being used, so its not needed. Only the /s (dot-all) modifier is necessary. The /U (un-greedy) modifier should never be used (in my opinion). The /u (unicode) modifier should be used.

If you are looking to un-wrap html inside a CDATA structure, its better to use the w3c specification for it, namely , even though your xml uses namespace names for its tags. This is only if the only element in the xml tag is a CDATA, and it is assumed the xml is well formed.

In the real world, comments could wrap a CDATA and visa-versa, as well as hiding many other things. So, the reality is that regex might be able to parse through mal-formed xml and then recover, but its not reliable and it is certainly more complicated.

That being said, this will extract the CDATA from your example and only in its literal sense.

if (preg_match(
   '~<content:encoded\s*>
       \s*
       <!\[CDATA\[ (.*?) \]\]>
       \s*
     </content:encoded\s*>~xsu',
    $string,
    $matches) )
{
 print ( $matches[1] );
}

answered Jul 9, 2012 at 18:17

user557597

1 Comment

MyStream Over a year ago

Hi, what we found was some xml (without the content encoded and description nodes) was well formed, but wouldn't parse with those nodes in places, even though it was wrapped with CDATA correctly. By removing just those elements (all content) and then trying to reparse, we could continue. Then we can run the extract content through a number of checks, including making it clean html with DOMDocument first and making sure entities are escaped within those and putting them back. It's far from ideal, but seems to be helping correct the bulk of the issues. Why would you not use /U, specifically?

Collectives™ on Stack Overflow

Regex for colon in an xml tag when parsing fails with php and simplexml_load_string

2 Answers 2

1 Comment

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related