0

I am having trouble trying to write a non-greedy regex statement.

Here is my string:

<strong>name</strong><strong>address</strong>mailto:[email protected]

Here is my regex query:

<strong>(.*?)</strong>.*?([A-Z0-9._%-]+@[A-Z0-9.-]+\.[A-Z]{2,4})

The problem is that I need the the address, not the name from the string. So I need the regex query to be non-greedy and take the closest <strong></strong> instead of the farthest away.

There are also multiple instances of this in my search string, so it would have to match multiple instances at a time instead of just adding a .* (greedy) thing in front of it.

So it would have to match all the instances of this, and pull the addresses, not names:

   <strong>name</strong><strong>address1</strong>mailto:[email protected]
   <strong>name</strong><strong>address2</strong>mailto:[email protected]
   <strong>name</strong><strong>address3</strong>mailto:[email protected]
   <strong>name</strong><strong>address4</strong>mailto:[email protected]

Thanks in advance!

5
  • 2
    HTML+RegEx=Sh*t-storm brewin'. Prepare for "don't use regex" answers/comments. -- On the other hand, I'm not sure I 100% understand the question. Can you provide example captures too? (Maybe adding $ to the end of your regex is what you're looking for?) Commented Mar 21, 2011 at 19:57
  • 1
    yeh i was just looking for a good post on the topic of that brad, lol. Snowman, suggest you use phpquery code.google.com/p/phpquery as it allows one to traverse the DOM much like jquery does. Commented Mar 21, 2011 at 19:59
  • @Jason: PHP already has the DOMDocument. ;-) Commented Mar 21, 2011 at 20:00
  • @Brad, the "phpquery" library provides numerous selector shortcuts missing from the plain vanilla DOM. Commented Mar 21, 2011 at 20:06
  • I think the problem with the attempted solution is that it tries to apply the non-greedy matching 'in reverse', forcing the <strong>/</strong>pair to match as far to the right as possible before the mail address. But that's not how non-greedy matches work. For a quick-and-dirty solution, I'd just use [^<]* instead of .*? -- since < is illegal in HTML except to start a tag, that will match any legal content of the address field, but keep it from matching the tags. Commented Mar 21, 2011 at 22:14

3 Answers 3

2

First, regular expressions are a suboptimal tool for matching HTML (this being a good example why this is so). You'll be happier with a parser if you know how to use one (maybe one of the PHP gurus can recommend one).

Having said that, a better way with regexes would probably be to match (and discard) the first <strong> tag explicitly:

<strong>.*?</strong><strong>(.*?)</strong>.*?([A-Z0-9._%-]+@[A-Z0-9.-]+\.[A-Z]{2,4})

This is by no means a good, reliable, bulletproof solution, but at least it works for your sample data.

Or, if you can be more specific about what's allowed between/after the relevant tag, how about this:

<strong>([^<>]*)</strong>(?:mailto:)?([A-Z0-9._%-]+@[A-Z0-9.-]+\.[A-Z]{2,4})
Sign up to request clarification or add additional context in comments.

1 Comment

The actual problem is that the .*? just before the email capture is also capturing the <strong> that's actually desired. (Just though I would mention it for OP so they know where they went wrong).
0

Looking at your test data, here are the rules I infer: If...

  1. Name and Address are both wrapped in STRONG elements and the email follows immediately, AND
  2. The STRONG elements' attributes, the name and the addresses all have no angle brackets, AND
  3. The email address component always begins with mailto:, AND
  4. There are no other HTML elements within the two STRONG elements,

Then this tested code should do the trick:

$re = '%
    # Capture name and address in <strong> element then email.
    <strong[^>]*>\s*([^<>]+)</strong\s*>\s*  # $1: Name.
    <strong[^>]*>\s*([^<>]+)</strong\s*>\s*  # $2: Address.
    (mailto:\S+)                             # $3: Email.
    %ix';
$count = preg_match_all($re, $text, $matches);
if ($count) {
    printf("%d matches found:\n", $count);
    print_r($matches);
    for ($i = 0; $i < $count; ++$i) {
        printf("Match %d: Name: \"%s\", Address: \"%s\", Email: \"%s\":\n",
            $i + 1, $matches[1][$i], $matches[2][$i], $matches[3][$i]);
    }
} else {
    printf("No matches found.\n");
}

Comments

0

Don't use regular expressions for parsing HTML.

See http://htmlparsing.com/php.html

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.