How do I extract links with a specific domain name using PHP and Regex?

Question

I am trying to extract urls that contain www.domain.com from a database column that contains HTML. The regex has to filter out www2.domain.com instances and external urls like www.domainxyz.com. It should only search for properly coded anchor links.

Here is what I have so far:

<?php
    $content = '<html>
    <title>Random Website</title>
    <body>
        Click <a href="http://domainxyz.com">here</a> for foobar
        Another site is http://www.domain.com
        <a href="http://www.domain.com/test">Test 1</a>
        <a href="http://www2.domain.com/test">Test 2</a>
        <Strong>NOT A LINK</strong>
    </body>
    </html>';

    $regex = "((https?)\:\/\/)?";
    $regex .= "([a-z0-9-.]*)\.([a-z]{2,4})"; 
    $regex .= "(\/([a-z0-9+\$_-]\.?)+)*\/?";
    $regex .= "(\?[a-z+&\$_.-][a-z0-9;:@&%=+\/\$_.-]*)?";
    $regex .= "(#[a-z_.-][a-z0-9+\$_.-]*)?"; 
    $regex .= "([www\.domain\.com])";

    $matches = array(); //create array
    $pattern = "/$regex/";

    preg_match_all($pattern, $content, $matches); 

    print_r(array_values(array_unique($matches[0])));
    echo "<br><br>";
    echo implode("<br>", array_values(array_unique($matches[0])));
?>

I am looking for this to find and output only http://www.domain.com/test.

How can I modify my Regex to accomplish this?

What about a DOMDocument and DOMXPath based solution? I see you just extract the href attribute values, right? — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Sep 8, 2015 at 21:30
Thanks, I considered this but would such a solution be possible if grabbing the html from a database query? — andyy15
– andyy15, Commented Sep 8, 2015 at 21:32
Please check this code. I'd suggest using regex here only as a means of last resort. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Sep 8, 2015 at 21:37

Wiktor Stribiżew · Accepted Answer · 2015-09-08 22:28:04Z

4

Here is a much safer way to extract the a href attribute values containing www.domain.com where the key is the XPath '//a[contains(@href, "www.domain.com")]':

$html = "YOUR_HTML_STRING"; // Your HTML string
$dom = new DOMDocument;    
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$arr = array();
$links = $xpath->query('//a[contains(@href, "www.domain.com")]');

foreach($links as $link) { 
   array_push($arr, $link->getAttribute("href"));
}

print_r($arr);

See IDEONE demo, result:

Array
(
    [0] => http://www.domain.com/test
)

As you see, you can use the DOMDocument and DOMXPath with a string, too.

The code is self-explanatory, the XPath expression just means find all <a> tags that have a href attribute containing www.domain.com.

answered Sep 8, 2015 at 22:28

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

andyy15 Over a year ago

Thank you for sharing a how to on the DOMDocument and DOMXPath approach. Since this is a better solution than regex, I ended up going this route :)

Collectives™ on Stack Overflow

How do I extract links with a specific domain name using PHP and Regex?

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related