3

I am trying to extract urls that contain www.domain.com from a database column that contains HTML. The regex has to filter out www2.domain.com instances and external urls like www.domainxyz.com. It should only search for properly coded anchor links.

Here is what I have so far:

<?php
    $content = '<html>
    <title>Random Website</title>
    <body>
        Click <a href="http://domainxyz.com">here</a> for foobar
        Another site is http://www.domain.com
        <a href="http://www.domain.com/test">Test 1</a>
        <a href="http://www2.domain.com/test">Test 2</a>
        <Strong>NOT A LINK</strong>
    </body>
    </html>';

    $regex = "((https?)\:\/\/)?";
    $regex .= "([a-z0-9-.]*)\.([a-z]{2,4})"; 
    $regex .= "(\/([a-z0-9+\$_-]\.?)+)*\/?";
    $regex .= "(\?[a-z+&\$_.-][a-z0-9;:@&%=+\/\$_.-]*)?";
    $regex .= "(#[a-z_.-][a-z0-9+\$_.-]*)?"; 
    $regex .= "([www\.domain\.com])";

    $matches = array(); //create array
    $pattern = "/$regex/";

    preg_match_all($pattern, $content, $matches); 

    print_r(array_values(array_unique($matches[0])));
    echo "<br><br>";
    echo implode("<br>", array_values(array_unique($matches[0])));
?>

I am looking for this to find and output only http://www.domain.com/test.

How can I modify my Regex to accomplish this?

3
  • What about a DOMDocument and DOMXPath based solution? I see you just extract the href attribute values, right? Commented Sep 8, 2015 at 21:30
  • Thanks, I considered this but would such a solution be possible if grabbing the html from a database query? Commented Sep 8, 2015 at 21:32
  • Please check this code. I'd suggest using regex here only as a means of last resort. Commented Sep 8, 2015 at 21:37

1 Answer 1

4

Here is a much safer way to extract the a href attribute values containing www.domain.com where the key is the XPath '//a[contains(@href, "www.domain.com")]':

$html = "YOUR_HTML_STRING"; // Your HTML string
$dom = new DOMDocument;    
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$arr = array();
$links = $xpath->query('//a[contains(@href, "www.domain.com")]');

foreach($links as $link) { 
   array_push($arr, $link->getAttribute("href"));
}

print_r($arr);

See IDEONE demo, result:

Array
(
    [0] => http://www.domain.com/test
)

As you see, you can use the DOMDocument and DOMXPath with a string, too.

The code is self-explanatory, the XPath expression just means find all <a> tags that have a href attribute containing www.domain.com.

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you for sharing a how to on the DOMDocument and DOMXPath approach. Since this is a better solution than regex, I ended up going this route :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.