1

I have a wysiwyg on a site. The problem is that the users are copy pasting a lot of data in to it leaving a lot of unclosed and improperly formatted div tags that are breaking the site layout.

Is there an easy an easy way to strip all occurrences of <div> and </div>?

str_replace won't work because some of the divs have styling and other things in them so it would need to account for <div style="some styling"> <div align="center"> etc.

I'm guessing this could be done with a regular expression but I am total a total beginner when it comes to those.

2
  • 1
    So you need to remove all the div tags but not the content between the div. Am I right? Commented Mar 7, 2012 at 18:34
  • Replace the XPath with //div[not[@*]] to remove all div elements (incl. content) without attributes. Commented Mar 7, 2012 at 19:58

3 Answers 3

6

Better to use DOM for HTML parser but if you have no choice but to use RegEx then you can use it like this:

$patterns = array();
$patterns[0] = '/<div[^>]*>/';
$patterns[1] = '/<\/div>/';
$replacements = array();
$replacements[2] = '';
$replacements[1] = '';
echo preg_replace($patterns, $replacements, $html);
Sign up to request clarification or add additional context in comments.

2 Comments

this will also match any div elements with attributes. OP asked to exclude those.
My interpretation of the OP's question is that he wants to remove div elements with attributes in addition to the simple <div>.
0

No. You do NOT ever parse/manipulate HTML with regexes.

Regexes cannot be bargained with. They can't be reasoned with. They don't understand html, they don't grok xml. And they absolute will NOT stop until your DOM tree is dead.

You use htmlpurifier and/or DOM to manipulate the tree.

5 Comments

I disagree about the absoluteness of your opening statement. While it's true most of the time - even the famous question suggests that there are some situations where regex can be appropriate stackoverflow.com/a/1733489/46675
Yes, "known limited". OP's working with user-submitted arbitrary html, so that exception definitely doesn't apply.
+1. go naked. Something about the field generated by a living organism. Nothing dead will go.
Thanks, htmlpurifier was exactly what I needed :)
Nice Terminator reference
0

Here's a simplified example of how you could do it with PHP

    <?php
    /**
     * Removes the divs because why not
     */
    function strip_divs(&$text, $id = 'html') {
      $replacements = array();
      worker($text, $replacements, $id);

      foreach ($replacements as $key => $val) {
        $text = mb_str_replace($key, $val, $text);
      }

      return $text;
    }

    function worker(&$body, &$replacements, $id) {
      static $call_count;
      if (empty($call_count)) {
        $call_count = array();
      }
      if (empty($call_count[$id])) {
        $call_count[$id] = 0;
      }

      if (mb_strpos($body, '</div>')) {
        $body = mb_str_replace('</div>', '', $body);
      }

      if (mb_strpos($body, '<di') !== FALSE) {
        $call_count[$id] ++;
        // Gets the important junk
        $rm               = '<di' . xml_get($body, '<di', '>') . '>';
        // Builds the replacements HTML
        $replacement_html = '';

        $next_id                       = count($replacements);
        $replacement_id                = "[[div-$next_id]]";
        $replacements[$replacement_id] = $replacement_html;

        $body = mb_str_replace($rm, $replacement_id, $body);

        if (mb_strpos($body, '<di') !== FALSE && $call_count[$id] < 200) {
          worker($body, $replacements, $id);
        }
      }
    }


    /**
     * Returns text by specifying a start and end point
     *
     * @param str $str
     *   The text to search
     * @param str $start
     *   The beginning identifier
     * @param str $end
     *   The ending identifier
     */
    function xml_get($str, $start, $end) {
      $str = "|" . $str . "|";
      $len = mb_strlen($start);
      if (mb_strpos($str, $start) > 0) {
        $int_start = mb_strpos($str, $start) + $len;
        $temp      = right($str, (mb_strlen($str) - $int_start));
        $int_end   = mb_strpos($temp, $end);
        $return    = trim(left($temp, $int_end));
        return $return;
      }
      else {
        return FALSE;
      }
    }

    function right($str, $count) {
      return mb_substr($str, ($count * -1));
    }

    function left($str, $count) {
      return mb_substr($str, 0, $count);
    }

    /**
     * Multibyte str replace
     */
    if (!function_exists('mb_str_replace')) {

      function mb_str_replace($search, $replace, $subject, &$count = 0) {
        if (!is_array($subject)) {
          $searches     = is_array($search) ? array_values($search) : array($search);
          $replacements = is_array($replace) ? array_values($replace) : array($replace);
          $replacements = array_pad($replacements, count($searches), '');
          foreach ($searches as $key => $search) {
            $parts   = mb_split(preg_quote($search), $subject);
            $count += count($parts) - 1;
            $subject = implode($replacements[$key], $parts);
          }
        }
        else {
          foreach ($subject as $key => $value) {
            $subject[$key] = mb_str_replace($search, $replace, $value, $count);
          }
        }
        return $subject;
      }

    }

    $html = <<<HTML
    <table>
        <tbody>
            <tr>
                <td class="votecell">
                    <div class="vote">
                        <input type="hidden" name="_id_" value="9607101">
                        <a class="vote-up-off" title="This question shows research effort; it is useful and clear">up vote</a>
                        <span itemprop="upvoteCount" class="vote-count-post ">0</span>
                        <a class="vote-down-off" title="This question does not show any research effort; it is unclear or not useful">down vote</a>
                        <a class="star-off" href="#">favorite</a>
                        <div class="favoritecount"><b></b></div>
                    </div>
                </td>
                <td class="postcell">
                    <div>
                        <div class="post-text" itemprop="text">
                            <p>I have a wysiwyg on a site. The problem is that the users are copy pasting a lot of data in to it leaving a lot of unclosed and improperly formatted div tags that are breaking the site layout. </p>
                            <p>Is there an easy an easy way to strip all occurrences of <code>&lt;div&gt;</code> and <code>&lt;/div&gt;</code>?</p>
                            <p>str_replace won't work because some of the divs have styling and other things in them so it would need to account for <code>&lt;div style="some styling"&gt; &lt;div align="center"&gt;</code> etc</p>
                            <p>I'm guessing this could be done with a regular expression but I am total a total beginner when it comes to those. </p>
                            <p>Thanks a lot,
                                Martin
                            </p>
                        </div>
                        <div class="post-taglist">
                            <a href="/questions/tagged/php" class="post-tag js-gps-track" title="show questions tagged 'php'" rel="tag">php</a> <a href="/questions/tagged/regex" class="post-tag js-gps-track" title="show questions tagged 'regex'" rel="tag">regex</a> <a href="/questions/tagged/replace" class="post-tag js-gps-track" title="show questions tagged 'replace'" rel="tag">replace</a> <a href="/questions/tagged/str-replace" class="post-tag js-gps-track" title="" rel="tag">str-replace</a> <a href="/questions/tagged/strip-tags" class="post-tag js-gps-track" title="show questions tagged 'strip-tags'" rel="tag">strip-tags</a>
                        </div>
                        <table class="fw">
                            <tbody>
                                <tr>
                                    <td class="vt">
                                        <div class="post-menu"><a href="/q/9607101" title="short permalink to this question" class="short-link" id="link-post-9607101">share</a><span class="lsep">|</span><a href="/posts/9607101/edit" class="suggest-edit-post" title="">improve this question</a></div>
                                    </td>
                                    <td align="right" class="post-signature">
                                        <div class="user-info ">
                                            <div class="user-action-time">
                                                <a href="/posts/9607101/revisions" title="show all edits to this post">edited <span title="2012-03-07 18:32:29Z" class="relativetime">Mar 7 '12 at 18:32</span></a>
                                            </div>
                                            <div class="user-gravatar32">
                                            </div>
                                            <div class="user-details">
                                                <div class="-flair">
                                                </div>
                                            </div>
                                        </div>
                                    </td>
                                    <td class="post-signature owner">
                                        <div class="user-info ">
                                            <div class="user-action-time">
                                                asked <span title="2012-03-07 18:31:11Z" class="relativetime">Mar 7 '12 at 18:31</span>
                                            </div>
                                            <div class="user-gravatar32">
                                                <a href="/users/702826/martin-hunt">
                                                    <div class="gravatar-wrapper-32"><img src="https://www.gravatar.com/avatar/a578c3eae229c86dbe46d4b1603e071b?s=32&amp;d=identicon&amp;r=PG" alt="" width="32" height="32"></div>
                                                </a>
                                            </div>
                                            <div class="user-details">
                                                <a href="/users/702826/martin-hunt">Martin Hunt</a>
                                                <div class="-flair">
                                                    <span class="reputation-score" title="reputation score " dir="ltr">313</span><span title="7 silver badges"><span class="badge2"></span><span class="badgecount">7</span></span><span title="20 bronze badges"><span class="badge3"></span><span class="badgecount">20</span></span>
                                                </div>
                                            </div>
                                        </div>
                                    </td>
                                </tr>
                            </tbody>
                        </table>
                    </div>
                </td>
            </tr>
            <tr>
                <td class="votecell"></td>
                <td>
                    <div id="comments-9607101" class="comments ">
                        <table>
                            <tbody data-remaining-comments-count="0" data-canpost="false" data-cansee="true" data-comments-unavailable="false" data-addlink-disabled="true">
                                <tr id="comment-12187969" class="comment ">
                                    <td class="comment-actions">
                                        <table>
                                            <tbody>
                                                <tr>
                                                    <td class=" comment-score">
                                                        <span title="number of 'useful comment' votes received" class="cool">1</span>
                                                    </td>
                                                    <td>
                                                        &nbsp;
                                                    </td>
                                                </tr>
                                            </tbody>
                                        </table>
                                    </td>
                                    <td class="comment-text">
                                        <div style="display: block;" class="comment-body">
                                            <span class="comment-copy">So you need to remove all the div tags but not the content between the div. Am I right?</span>
                                            –&nbsp;<a href="/users/500725/siva-charan" title="14,075 reputation" class="comment-user">Siva Charan</a>
                                            <span class="comment-date" dir="ltr"><a class="comment-link" href="#comment12187969_9607101"><span title="2012-03-07 18:34:11Z" class="relativetime-clean">Mar 7 '12 at 18:34</span></a></span>
                                        </div>
                                    </td>
                                </tr>
                                <tr id="comment-12189778" class="comment ">
                                    <td>
                                        <table>
                                            <tbody>
                                                <tr>
                                                    <td class=" comment-score">
                                                        &nbsp;&nbsp;
                                                    </td>
                                                    <td>
                                                        &nbsp;
                                                    </td>
                                                </tr>
                                            </tbody>
                                        </table>
                                    </td>
                                    <td class="comment-text">
                                        <div style="display: block;" class="comment-body">
                                            <span class="comment-copy"><a href="http://stackoverflow.com/a/4667535/208809">Replace the XPath with <code>//div[not[@*]]</code></a> to remove all div elements (incl. content) without attributes.</span>
                                            –&nbsp;<a href="/users/208809/gordon" title="225,421 reputation" class="comment-user">Gordon</a>
                                            <span class="comment-date" dir="ltr"><a class="comment-link" href="#comment12189778_9607101"><span title="2012-03-07 19:58:21Z" class="relativetime-clean">Mar 7 '12 at 19:58</span></a></span>
                                            <span class="edited-yes" title="this comment was edited 2 times"></span>
                                        </div>
                                    </td>
                                </tr>
                            </tbody>
                        </table>
                    </div>
                    <div id="comments-link-9607101" data-rep="50" data-anon="true">
                        <a class="js-add-link comments-link disabled-link " title="Use comments to ask for more information or suggest improvements. Avoid answering questions in comments.">add a comment</a><span class="js-link-separator dno">&nbsp;|&nbsp;</span>
                        <a class="js-show-link comments-link dno" title="expand to show all comments on this post" href="#" onclick=""></a>
                    </div>
                </td>
            </tr>
        </tbody>
    </table>
    HTML;

    echo strip_divs($html);

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.