5

Need a regex for preg_replace.

This question wasn't answered in "another question" because not all tags I want to remove aren't empty.

I have not only to remove empty tags from an HTML structure, but also tags containing line breaks as well as white spaces and/or their html code.

Possible Codes are:

<br /> &nbsp; &thinsp; &ensp; &emsp; &#8201; &#8194; &#8195;

BEFORE removing matching tags:

<div> 
  <h1>This is a html structure.</h1> 
  <p>This is not empty.</p> 
  <p></p> 
  <p><br /></p>
  <p> <br /> &;thinsp;</p>
  <p>&nbsp;</p> 
  <p> &nbsp; </p> 
</div>

AFTER removing matching tags:

<div> 
  <h1>This is a html structure.</h1> 
  <p>This is not empty.</p> 
</div>
1

3 Answers 3

8

You can use the following:

<([^>\s]+)[^>]*>(?:\s*(?:<br \/>|&nbsp;|&thinsp;|&ensp;|&emsp;|&#8201;|&#8194;|&#8195;)\s*)*<\/\1>

And replace with '' (empty string)

See DEMO

Note: This will also work for empty html tags with attributes.

Sign up to request clarification or add additional context in comments.

3 Comments

Hi, i update your test and fails when added empty spaces inside of a tag. Here's the link with error and this other link is with the adjust. You just missed before closing the capture group this |&#8195;)\s*|\s*)
This is a great answer, but should be updated to this in order to deal with these two cases: <br> and <br/>
To exclude tags like iframe, canvas, etc preg_replace('~<((?!iframe|canvas)\w+)[^>]*>(?:\s*(?:<br \/>|&nbsp;|&thinsp;|&ensp;|&emsp;|&#8201;|&#8194;|&#8195;)\s*)*<\/\1>~iu', "", $html)
1

Use tidy It uses the following function:

function cleaning($string, $tidyConfig = null) {
    $out = array ();
    $config = array (
            'indent' => true,
            'show-body-only' => false,
            'clean' => true,
            'output-xhtml' => true,
            'preserve-entities' => true 
    );
    if ($tidyConfig == null) {
        $tidyConfig = &$config;
    }
    $tidy = new tidy ();
    $out ['full'] = $tidy->repairString ( $string, $tidyConfig, 'UTF8' );
    unset ( $tidy );
    unset ( $tidyConfig );
    $out ['body'] = preg_replace ( "/.*<body[^>]*>|<\/body>.*/si", "", $out ['full'] );
    $out ['style'] = '<style type="text/css">' . preg_replace ( "/.*<style[^>]*>|<\/style>.*/si", "", $out ['full'] ) . '</style>';
    return ($out);
}

Comments

0

I'm not so good with but, try this

\<.*\>\s*\&.*sp;\s*\<\/.*\>|\<.*\>\s*\<\s*br\s*\/\>\s*\&.*sp;\s*\<\/.*\>|\<.*\>\s*\&.*sp;\s*\<\s*br\s*\/\>\<\/.*\>

Basically matches

  • Tags with HTML space elements in them OR
  • Tags with breaks occurring before HTML space elements in them
  • Tags with breaks occurring after HTML space elements in them

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.