2

I want to clean the comments and some other garbage or tags from the <body> section in HTML using PHP and regex but my code not work:

$str=preg_replace_callback('/<body>(.*?)<\/body>/s', 
    function($matches){
        return '<body>'.preg_replace(array(
            '/<!--(.|\s)*?-->/',
        ),
        array(
            '',
        ), $matches[1]).'</body>';
    }, $str);

The problem is that nothing happens. Comments will remain where they are or any cleaning to do, nothing happens. Can you help? Thanks!

EDIT:

Thanks to @mhall I figureout that my regex not work becouse of attributes in <body> tag. I use his code and update this:

$str = preg_replace_callback('/(?=<body(.*?)>)(.*?)(?<=<\/body>)/s',
    function($matches) {
        return preg_replace('/<!--.*?-->/s', '', $matches[2]);
    }, $str);

This work PERFECT!

Thanks people!

11
  • stackoverflow.com/a/1732454/3044080 Commented May 1, 2015 at 15:17
  • Why do you want to clean out the comments? You could use DOMDocument or another document parser to do this more easily. Commented May 1, 2015 at 15:25
  • Without talking about how not pertinent it is to use regex, I think your problem comes from the <body>(.*?)<\/body> part. By default, the "." doesn't include line breaks. You might want to replace it by [\s\S] ## EDIT: Nevermind, didn't see the "s" flag. Commented May 1, 2015 at 15:47
  • @ExplosionPills I want a simple way to clean up some things from HTML. Commented May 1, 2015 at 15:59
  • Works for me (PHP 5.5.14), but it drops the <body>/</body> tags as well. What string are you trying with? Commented May 1, 2015 at 16:01

2 Answers 2

2

Try this. Made a modification on the preg_replace_callback not to include the body tags and replaced (.|\s) with a .* in preg_replace. Also dropped the array syntax from that and added a /s modifier:

$str = <<<EOS
<html>
    <body>
        <p>
             Here is some <!-- One comment --> text
             with a few <!--
                Another comment
             -->
             Comments in it
        </p>
    </body>
</html>
EOS;

$str = preg_replace_callback('/(?=<body>)(.*?)(?<=<\/body>)/s',
    function($matches) {
        return preg_replace('/<!--.*?-->/s', '', $matches[1]);
    }, $str);

echo $str, PHP_EOL;

Output:

<html>
    <body>
        <p>
             Here is some  text
             with a few 
             Comments in it
        </p>
    </body>
</html>
Sign up to request clarification or add additional context in comments.

1 Comment

Does your body tag have any class declarations or such, or is it just a plain <body>?
0

Aren't you making it too complicated? You don't need to jump in and out via a callback, since preg_replace will make replacements at every match:

$parts = explode("<body", $str, 2);
$clean = preg_replace('/<!--.*?-->/s', '', $parts[1]);
$str = $parts[0]."<body".$clean;

Splitting the string into head and body excludes the head from substitution without a lot of messy regexps. Note the s after the pattern: '/.../s'. This makes the dot in the regexp match embedded newlines along with other characters.

2 Comments

No becouse in <head> tag I need to keep some comments for browser switcher.
Oh, I see. But still it would be much cleaner to split the string in two with $parts = explode("<body", $str, 2);, substitute in $parts[1], and reassemble with $str = $parts[0]."<body".$parts[1];

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.