0

I have a simple question for regex gurus. And yes... I did try several different variations of the regex before posting here. Forgive my regex ignorance. This is targeting PHP.

I have the following HTML:

<div>
    <h4>
        <a href="somelink.html">some text blah</a>
    </h4>
    I need this text<br />I need this text too.<br />
</div>
<div>
    <h4>
        <a href="somelink.html">some text blah</a>
    </h4>
    I need this text<br />I need this text too.<br />
</div>
<div>
    <h4>
        <a href="somelink.html">some text blah</a>
    </h4>
    I need this text<br />I need this text too.<br />
</div>

What I tried that seemed most likely to work:

 preg_match_all('/<div><h4><a href=".*">.*<\/a><\/h4>(.*)<br \/>(.*)<br \/>/', $haystack, $result);

The above returns nothing.

So then I tried this and I got the first group to match, but I have not been able to get the second.

preg_match_all('/<div><h4><a href=".*">.*<\/a><\/h4>(.*)<br \/>/', $haystack, $result);

Thank you!

6

3 Answers 3

2

Regex is great. But, some things are best tackled with a parser. Markup is one such example.

Instead of using regex, I'd use an HTML parser, like http://simplehtmldom.sourceforge.net/

However, if you insist on using regex for this specific case, you can use this pattern:

if (preg_match('%</h4>(\\r?\\n)\\s+(.*?)(<br />)(.*?)(<br />)%', $subject, $regs)) {
    $first_text_string = $regs[2];
    $second_text_string = $regs[4];
} else {
    //pattern not found
}
Sign up to request clarification or add additional context in comments.

2 Comments

A comparative list of alternatives to simplehtmldom (which can be quite slow and cumbersome) can be found here
FYI, I also recommend RegexBuddy, as I've mentioned previously in this post: stackoverflow.com/a/18132398/278976
0

I highly recommend using DOM and XPath for this.

$doc = new DOMDocument;
@$doc->loadHTML($html); 

$xp = new DOMXPath($doc);

foreach($xp->query('//div/text()') as $n) {
   list($before, $after) = explode('<br />', trim($n->wholeText));
   echo $before . "\n" . $after;
}

But If you still decide to take the regex route, this will work for you.

preg_match_all('#</h4>\s*([^<]+)<br />([^<]+)#', $str, $matches);

1 Comment

This worked as advertised. The others would not catch repeating groups. Thanks!
0

This will do what you want given the exact input you provided. If you need something more generic please let me know.

(.*)<br\s*\/>(.*)<br\s*\/>

See here for a live demo http://www.phpliveregex.com/p/1i3

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.