4

I'm trying to find eveything inside a div using regexp. I'm aware that there probably is a smarter way to do this - but I've chosen regexp.

so currently my regexp pattern looks like this:

$gallery_pattern = '/<div class="gallery">([\s\S]*)<\/div>/';  

And it does the trick - somewhat.

The problem is if i have two divs after each other - like this.

<div class="gallery">text to extract here</div>
<div class="gallery">text to extract from here as well</div>

I want to extract the information from both divs, but my problem, when testing, is that im not getting the text in between as a result but instead:

"text to extract here </div>  
<div class="gallery">text to extract from here as well"

So to sum up. It skips the first end of the div. and continues on to the next. The text inside the div can contain <, / and linebreaks. just so you know!

Does anyone have a simple solution to this problem? Im still a regexp novice.

1
  • I've been discussing the same w/ my friend few weeks ago. The problem is when you have tags like these "<div class="gallery">some text<div>other text</div></div>", it is hard to make the expression not stop on the first </div> Commented Aug 29, 2009 at 18:41

2 Answers 2

12

You shouldn't be using regex to parse HTML when there's a convenient DOM library:

$str = '
<div class="gallery">text to extract here</div>
<div class="gallery">text to extract from here as well</div>
';

$doc = new DOMDocument();
$doc->loadHTML($str);
$divs = $doc->getElementsByTagName('div');

if ( count($divs ) ) {
    foreach ( $divs as $div ) {
    echo $div->nodeValue . '<br>';
    }
}
Sign up to request clarification or add additional context in comments.

Comments

11

What about something like this :

$str = <<<HTML
<div class="gallery">text to extract here</div>
<div class="gallery">text to extract from here as well</div>
HTML;

$matches = array();
preg_match_all('#<div[^>]*>(.*?)</div>#s', $str, $matches);

var_dump($matches[1]);

Note the '?' in the regex, so it is "not greedy".

Which will get you :

array
  0 => string 'text to extract here' (length=20)
  1 => string 'text to extract from here as well' (length=33)

This should work fine... If you don't have imbricated divs ; if you do... Well... actually : are you really sure you want to use rational expressions to parse HTML, which is quite not that rational itself ?

1 Comment

@Filip : I would recommend using DOM and loadHTML too, actually -- I did several times, in other answers (see stackoverflow.com/questions/1274020/… for instance) : HTML is not something that can be properly parsed with regexes... not rational enough, I suppose ^^

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.