0

I use PHP pattern modifier "U" to invert the default greedy behavior with preg_match(). However, it doesn't work the way I want. My code:

$str = '<p>
<div><a aaa
    <a href="a.mov"></a>
  </div>
</p>';

$needle = "a.mov";

$pattern = "/\<a.*".preg_quote($needle, "/").".*\<\/a\>/sU";

preg_match($pattern, $str, $matches);
print_r($matches);

I'm trying to match on

<a href="a.mov"></a>

But this chunk of code returns me

<a aaa
    <a href="a.mov"></a>

Can someone shed me some light of where I did wrong?

3
  • your $matches variable doesn't equal anything, does it? How do you print it when its not initialized Commented Oct 14, 2011 at 20:33
  • Check this out: stackoverflow.com/questions/1732348/… and then rewrite this to use DOM operations instead of Regexes. Your broken <a aaa tag demonstrates why regexes cannot be used reliably on HTML - html is NOT a regular language. Commented Oct 14, 2011 at 20:34
  • @Grigor: it's initialized/populated by preg_match Commented Oct 14, 2011 at 20:34

2 Answers 2

2

Well, in more general sense, you did wrong when trying to parse HTML with regexps, but regarding the snippet of code you have provided, the problem is that the ungreedy modifier tells *, + and {n,} to stop as soon as they are happy instead of going all the way.

So it essentially affects where the matching ends instead of where it begins - "ungreedy" is not intended to mean "give me the shortest" match possible.

You can kind of like fix this particular example using mU modifiers instead of sU, so that . don't match new lines.

Sign up to request clarification or add additional context in comments.

1 Comment

+1. "greedy" and "non-greedy" are misnomers. If we called them "eager" and "reluctant" instead, we might prevent some of this confusion. It seems like everybody has to learn this lesson the hard way. (FYI, there's no need to add the m modifier; just remove the s.)
0

My array is turning up empty as well. You have to be careful about linebreaks when you try to use Regex with HTML. There may be an issue with single line mode.

See: http://www.regular-expressions.info/dot.html

I've successfully parsed HTML with regex but I wouldn't do it going forward. Look into

http://simplehtmldom.sourceforge.net/

You will never look back.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.