0

im trying to check for the a tags which contains specific domain ... but this a tag might be with or without www , hhtp , https

$a = '  <a href="https://example.com"></a>
                <a href="http://www.example.com"></a> 
                <a href="http://example.com"></a> 
                <a href="https://www.example.com"></a> 
                <a href="http://example.com"></a> 
                ';
        $reg_exUrl = "/(http|https)\:\/\/(www.)?example+\.com(\/\S*)?/";

        preg_match($reg_exUrl, $a, $url) ;
        var_dump($url);

but i dont get all the links this is the output

array:2 [▼
  0 => "https://example.com"
  1 => "https"
]

also im not sure how to include href so it would only search inside href

0

3 Answers 3

3

Use a HTML parser, and then the URL parser to get the domain. From there use a regex on the limited string:

$a = '  <a href="https://example.com"></a>
                <a href="http://www.example.com"></a> 
                <a href="http://example.com"></a> 
                <a href="https://www.example.com"></a> 
                <a href="http://example.com"></a> 
                ';
$dom = new DOMDocument;
$dom->loadHTML($a);
$links = $dom->getElementsByTagName('a');
foreach($links as $link) {
    $host = parse_url($link->getAttribute('href'))['host'];
    if(!empty($host) && preg_match('/(^|\.)example\.com$/', $host)) {
         echo 'Expected domain';
    } 
}

Also to explain a bit more about what your current output was...preg_match outputs the first match found and each index is one capture group.

 $reg_exUrl = "/(http|https)\:\/\/(www.)?example+\.com(\/\S*)?/";
                 ^^^^^^^^^^        ^^^^                ^^^^^

So as displayed above you have 3 possible capture groups. You can use ?: at the start of them so it is not captured. You http|https can be simplified to https? (the ? makes the s optional.

Sign up to request clarification or add additional context in comments.

3 Comments

thanx , i knew about that but this might give me otherexample.com/script.php?link=example.com or something like that which is not what i want
@hretic Actually (^|\.) is what you'd want. See 3v4l.org/Ihs6s I think that is all edge cases.
Good answer. You can also move all the test part into a function and use an XPath query, see 3v4l.org/4WZPi
0

Here you have:

$a = '  <a href="https://example.com"></a>
            <a href="http://www.example.com"></a> 
            <a href="http://example.com"></a> 
            <a href="https://www.example.com"></a> 
            <a href="http://example.com"></a> 
            ';
    $reg_exUrl = "/href=\"(?:https?)\:\/\/(?:www\.)?example\.com\"/";

    preg_match_all($reg_exUrl, $a, $url) ;
    var_dump($url);

2 Comments

thanx , how you add href in there ? so it would only search inside href not the whole string
I've updated the answer. The only thing is to escape quotes " >> \"
-1

instead of preg_match, use preg_match_all

UPD: all url site regex:

$regex = '/href="(.*?)"/';

1 Comment

This will match more than example.com.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.