HTML Regex to Extract Data

Question

I have a simple question for regex gurus. And yes... I did try several different variations of the regex before posting here. Forgive my regex ignorance. This is targeting PHP.

I have the following HTML:

<div>
    <h4>
        <a href="somelink.html">some text blah</a>
    </h4>
    I need this text<br />I need this text too.<br />
</div>
<div>
    <h4>
        <a href="somelink.html">some text blah</a>
    </h4>
    I need this text<br />I need this text too.<br />
</div>
<div>
    <h4>
        <a href="somelink.html">some text blah</a>
    </h4>
    I need this text<br />I need this text too.<br />
</div>

What I tried that seemed most likely to work:

 preg_match_all('/<div><h4><a href=".*">.*<\/a><\/h4>(.*)<br \/>(.*)<br \/>/', $haystack, $result);

The above returns nothing.

So then I tried this and I got the first group to match, but I have not been able to get the second.

preg_match_all('/<div><h4><a href=".*">.*<\/a><\/h4>(.*)<br \/>/', $haystack, $result);

Thank you!

possible duplicate of RegEx match open tags except XHTML self-contained tags — Dai
– Dai, Commented Sep 24, 2013 at 0:41

Homer6 · Accepted Answer · 2013-09-24 00:45:47Z

2

Regex is great. But, some things are best tackled with a parser. Markup is one such example.

Instead of using regex, I'd use an HTML parser, like http://simplehtmldom.sourceforge.net/

However, if you insist on using regex for this specific case, you can use this pattern:

if (preg_match('%</h4>(\\r?\\n)\\s+(.*?)(<br />)(.*?)(<br />)%', $subject, $regs)) {
    $first_text_string = $regs[2];
    $second_text_string = $regs[4];
} else {
    //pattern not found
}

answered Sep 24, 2013 at 0:45

Homer6

15.2k11 gold badges65 silver badges83 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Wrikken Over a year ago

A comparative list of alternatives to simplehtmldom (which can be quite slow and cumbersome) can be found here

Homer6 Over a year ago

FYI, I also recommend RegexBuddy, as I've mentioned previously in this post: stackoverflow.com/a/18132398/278976

hwnd · Accepted Answer · 2015-08-03 19:33:49Z

0

I highly recommend using DOM and XPath for this.

$doc = new DOMDocument;
@$doc->loadHTML($html); 

$xp = new DOMXPath($doc);

foreach($xp->query('//div/text()') as $n) {
   list($before, $after) = explode('<br />', trim($n->wholeText));
   echo $before . "\n" . $after;
}

But If you still decide to take the regex route, this will work for you.

preg_match_all('#</h4>\s*([^<]+)<br />([^<]+)#', $str, $matches);

edited Aug 3, 2015 at 19:33

answered Sep 24, 2013 at 2:13

hwnd

70.9k4 gold badges100 silver badges135 bronze badges

1 Comment

a432511 Over a year ago

This worked as advertised. The others would not catch repeating groups. Thanks!

Timothy Huertas · Accepted Answer · 2013-09-24 01:00:10Z

0

This will do what you want given the exact input you provided. If you need something more generic please let me know.

(.*)<br\s*\/>(.*)<br\s*\/>

See here for a live demo http://www.phpliveregex.com/p/1i3

answered Sep 24, 2013 at 1:00

Timothy Huertas

1136 bronze badges

Collectives™ on Stack Overflow

HTML Regex to Extract Data

3 Answers 3

2 Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related