PHP preg_match() ungreedy match issue

Question

I use PHP pattern modifier "U" to invert the default greedy behavior with preg_match(). However, it doesn't work the way I want. My code:

$str = '<p>
<div><a aaa
    <a href="a.mov"></a>
  </div>
</p>';

$needle = "a.mov";

$pattern = "/\<a.*".preg_quote($needle, "/").".*\<\/a\>/sU";

preg_match($pattern, $str, $matches);
print_r($matches);

I'm trying to match on

<a href="a.mov"></a>

But this chunk of code returns me

<a aaa
    <a href="a.mov"></a>

Can someone shed me some light of where I did wrong?

your $matches variable doesn't equal anything, does it? How do you print it when its not initialized — Grigor
– Grigor, Commented Oct 14, 2011 at 20:33
Check this out: stackoverflow.com/questions/1732348/… and then rewrite this to use DOM operations instead of Regexes. Your broken <a aaa tag demonstrates why regexes cannot be used reliably on HTML - html is NOT a regular language. — Marc B
– Marc B, Commented Oct 14, 2011 at 20:34

Fluffy · Accepted Answer · 2011-10-14 20:53:56Z

2

Well, in more general sense, you did wrong when trying to parse HTML with regexps, but regarding the snippet of code you have provided, the problem is that the ungreedy modifier tells *, + and {n,} to stop as soon as they are happy instead of going all the way.

So it essentially affects where the matching ends instead of where it begins - "ungreedy" is not intended to mean "give me the shortest" match possible.

You can kind of like fix this particular example using mU modifiers instead of sU, so that . don't match new lines.

answered Oct 14, 2011 at 20:53

Fluffy

28.6k42 gold badges158 silver badges240 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Alan Moore Over a year ago

+1. "greedy" and "non-greedy" are misnomers. If we called them "eager" and "reluctant" instead, we might prevent some of this confusion. It seems like everybody has to learn this lesson the hard way. (FYI, there's no need to add the m modifier; just remove the s.)

Len · Accepted Answer · 2011-10-14 20:44:25Z

0

My array is turning up empty as well. You have to be careful about linebreaks when you try to use Regex with HTML. There may be an issue with single line mode.

See: http://www.regular-expressions.info/dot.html

I've successfully parsed HTML with regex but I wouldn't do it going forward. Look into

http://simplehtmldom.sourceforge.net/

You will never look back.

answered Oct 14, 2011 at 20:44

Len

5421 gold badge5 silver badges11 bronze badges

Collectives™ on Stack Overflow

PHP preg_match() ungreedy match issue

2 Answers 2

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related