3

I researched this quite a bit, but couldn't find a working example how to match nested html tags with attributes. I know it is possible to match balanced/nested innermost tags without attributes (for example a regex for and would be #<div\b[^>]*>(?:(?> [^<]+ ) |<(?!div\b[^>]*>))*?</div>#x).

However, I would like to see a regex pattern that finds an html tag pair with attributes.

Example: It basically should match

<div class="aaa"> **<div class="aaa">** <div> <div> </div> **</div>** </div>

and not

<div class="aaa"> **<div class="aaa">** <div> <div> **</div>** </div> </div>

Anybody has some ideas?

For testing purposes we could use: http://www.lumadis.be/regex/test_regex.php


PS. Steven mentioned a solution in his blog (actually in a comment), but it doesn't work

http://blog.stevenlevithan.com/archives/match-innermost-html-element

$regex = '/<div\b[^>]+?\bid\s*=\s*"MyID"[^>]*>(?:((?:[^<]++|<(?!\/?div\b[^>]*>))+)|(<div\b[^>]*>(?>(?1)|(?2))*<\/div>))?<\/div>/i';
2
  • 1
    It is usually not a good idea to try and parse html/xml with regex. If you could tell us specifically what you are trying to do, we may be able to point you in a more appropriate direction :o) Commented Jun 19, 2010 at 16:32
  • Just to clarify. This is more of a theoretical discussion, just for fun. Of course in real life I would use xpath or so. I understand that "finite state" or "true" regex are not able to do that, but what about the PHP/PCRE flavor of regex (which are not really "classical" regex anymore, for example they even support recursive patterns ?R). – Dave 0 secs ago edit Commented Jun 20, 2010 at 0:44

3 Answers 3

6

Matching innermost matching pairs of <div> & </div> tags, plus their attributes & content:

#<div(?:(?!(<div|</div>)).)*</div>#s

The key here is that (?:(?!STRING).)* is to strings as [^CHAR]* is to characters.

Credit: https://stackoverflow.com/a/6996274


Example in PHP:

<?php

$text = <<<'EOD'
<div id="1">
  in 1
  <div id="2">
    in 2
    <div id="3">
      in 3
    </div>
  </div>
</div>
<div id="4">
  in 4
  <div id="5">
    in 5
  </div>
</div>
EOD;

$matches = array();
preg_match_all('#<div(?:(?!(<div|</div>)).)*</div>#s', $text, $matches);

foreach ($matches[0] as $index => $match) {
  echo "************" . "\n" . $match . "\n";
}

Outputs:

************
<div id="3">
      in 3
    </div>
************
<div id="5">
    in 5
  </div>
Sign up to request clarification or add additional context in comments.

Comments

2

RegEx match open tags except XHTML self-contained tags

And indeed, it is absolutely impossible. HTML has something unique, something magical, which is immune to RegEx.

3 Comments

something magical, which is immune to RegEx == XML, HTML, and friends are no regular languages
It's bad enough having to see links to The Rant in every other question; copying it is going too far. It isn't that funny, and more to the point, it isn't helpful.
Just to clarify. This is more of a theoretical discussion, just for fun. Of course in real life I would use xpath or so. I understand that "finite state" or "true" regex are not able to do that, but what about the PHP/PCRE flavor of regex (which are not really "classical" regex anymore, for example they even support recursive patterns ?R).
0

You can do it recursively, using the same regex but executed while needed. Like this:

function htmlToPlainText(html) {
    let text = html || ''

    // as there is html nested inside some html attributes, we need a recursive strategy to clean up the html
    while (text !== (text = text.replace(/<[^<>]*>/g, '')));

    return text
  }

This works with cases like:

<p data-attr="<span>Oh!</span>">Lorem Ipsum</p>

I found this script here: http://blog.stevenlevithan.com/archives/reverse-recursive-pattern

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.