19

I need to get the domain name from an URL. The following examples should all return google.com:

google.com
images.google.com
new.images.google.com
www.google.com

Similarly the following URLs should all return google.co.uk.

google.co.uk
images.google.co.uk
new.images.google.co.uk
http://www.google.co.uk

I'm hesitant to use Regular Expressions, because something like domain.com/google.com could return incorrect results.

How can I get the top-level domain, using PHP? This needs to work on all platforms and hosts.

4
  • 1
    This is tricky. For google.com, you're interested in the TLD and second-level domain name. For google.co.uk, you want the TLD and second and third level domain names. There's no defined "base name", what you mean by "base name" is different for different registrars/TLDs. Commented Jul 9, 2010 at 9:42
  • 1
    I'm pretty sure you have to get a bit long winded here, what you are asking for is eating your cake and having it too. Without a list of TLD's there is no way to differentiate between co.uk and google.com, they're both the host name. Commented Jul 9, 2010 at 9:43
  • I guess you guys are right, it doesn't look like anything is gonna work without lots of code Commented Jul 9, 2010 at 9:46
  • Try gist.github.com/praisedpk/64bdb80d28144aa78d58469324432277 Commented Sep 18, 2016 at 20:26

8 Answers 8

19

You could do this:

$urlData = parse_url($url);

$host = $urlData['host'];

** Update **

The best way I can think of is to have a mapping of all the TLDs that you want to handle, since certain TLDs can be tricky (co.uk).

// you can add more to it if you want
$urlMap = array('com', 'co.uk');

$host = "";
$url = "http://www.google.co.uk";

$urlData = parse_url($url);
$hostData = explode('.', $urlData['host']);
$hostData = array_reverse($hostData);

if(array_search($hostData[1] . '.' . $hostData[0], $urlMap) !== FALSE) {
  $host = $hostData[2] . '.' . $hostData[1] . '.' . $hostData[0];
} elseif(array_search($hostData[0], $urlMap) !== FALSE) {
  $host = $hostData[1] . '.' . $hostData[0];
}

echo $host;
Sign up to request clarification or add additional context in comments.

Comments

7

top-level domains and second-level domains may be 2 characters long but a registered subdomain must be at least 3 characters long.

EDIT: because of pjv's comment, i learned Australian domain names are an exception because they allow 5 TLDs as SLDs (com,net,org,asn,id) example: somedomain.com.au. i'm guessing com.au is nationally controlled domain name which "shares". so, technically, "com.au" would still be the "base domain", but that's not useful.

EDIT: there are 47,952 possible three-letter domain names (pattern: [a-zA-Z0-9][a-zA-Z0-9-][a-zA-Z0-9] or 36 * 37 * 36) combined with just 8 of the most common TLDS (com,org,etc) we have 383,616 possibilities -- without even adding in the entire scope of TLDs. 1-letter and 2-letter domain names still exist, but are not valid going forward.

in google.com -- "google" is a subdomain of "com"

in google.co.uk -- "google" is a subdomain of "co", which in turn is a subdomain of "uk", or a second-level domain really, since "co" is also a valid top-level domain

in www.google.com -- "www" is a subdomain of "google" which is a subdomain of "com"

"co.uk" is NOT a valid host because there is no valid domain name

going with that assumption this function will return the proper "basedomain" in almost all cases, without requiring a "url map".

if you happen to be one of the rare cases, perhaps you can modify this to fulfill particular needs...

EDIT: you must pass the domain string as a URL with it's protocol (http://, ftp://, etc) or parse_url() will not consider it a valid URL (unless you want to modify the code to behave differently)

function basedomain( $str = '' )
{
    // $str must be passed WITH protocol. ex: http://domain.com
    $url = @parse_url( $str );
    if ( empty( $url['host'] ) ) return;
    $parts = explode( '.', $url['host'] );
    $slice = ( strlen( reset( array_slice( $parts, -2, 1 ) ) ) == 2 ) && ( count( $parts ) > 2 ) ? 3 : 2;
    return implode( '.', array_slice( $parts, ( 0 - $slice ), $slice ) );
}

if you need to be accurate use fopen or curl to open this URL: http://data.iana.org/TLD/tlds-alpha-by-domain.txt

then read the lines into an array and use that to compare the domain parts

EDIT: to allow for Australian domains:

function au_basedomain( $str = '' )
{
    // $str must be passed WITH protocol. ex: http://domain.com
    $url = @parse_url( $str );
    if ( empty( $url['host'] ) ) return;
    $parts = explode( '.', $url['host'] );
    $slice = ( strlen( reset( array_slice( $parts, -2, 1 ) ) ) == 2 ) && ( count( $parts ) > 2 ) ? 3 : 2;
    if ( preg_match( '/\.(com|net|asn|org|id)\.au$/i', $url['host'] ) ) $slice = 3;
    return implode( '.', array_slice( $parts, ( 0 - $slice ), $slice ) );
}

IMPORTANT ADDITIONAL NOTES: I don't use this function to validate domains. It is generic code I only use to extract the base domain for the server it is running on from the global $_SERVER['SERVER_NAME'] for use within various internal scripts. Considering I have only ever worked on sites within the US, I have never encountered the Australian variants that pjv asked about. It is handy for internal use, but it is a long way from a complete domain validation process. If you are trying to use it in such a way, I recommend not to because of too many possibilities to match invalid domains.

6 Comments

If you change that strlen() == 2 to <=3 you'll catch 99% of the domains, save subdomains on localhost and whatnot. Here's my revision tidied up: gist.github.com/anonymous/fe77c97e632675411c3c
No, the revision does not work correctly. It needs to be == 2 because <= 3 will match when the next to the last part is 3 -- which we don't want to do. We want it to return "google.com" from "www.google.com" or "mail.google.com", and we want it to return "google.co.uk" from "www.google.co.uk" or "mail.google.co.uk"
@Mahn Additionally, there are many extra bits in your revision -- unneeded variable assignments and unneeded condition nesting. More code and undesired result -- did you test your revision thoroughly?
@Mahn also, your revision is triggering an error at: $middlePart = array_slice($parts, -2, 1)[0]; near [0]
My revision runs fine on production with 5.5, perhaps you are using an older PHP version? The extra nesting and variable assignment is there for sanity and readability, I personally dislike code that looks like it's hacked together for a hackathon, but that's just a personal preference. I also find <=3 accurate enough for my needs, since I'm not working with three letter domains, it's probably ought to be accurate enough for most people.
|
5

Try using: http://php.net/manual/en/function.parse-url.php. Something like this should work:

$urlParts = parse_url($yourUrl);
$hostParts = explode('.', $urlParts['host']);
$hostParts = array_reverse($hostParts);
$host = $hostParts[1] . '.' . $hostParts[0];

2 Comments

That would break if you have something like this: google.co.uk - in that case, it'd return "co.uk".
It would indeed, the only way to get that sorted though is by using a TLD list.
2

Mixing with xil3 answer this is I got to check localhost as well as ip, so you can also work in development environment.
You still have to define what TLDs you want to use. other than that everything works fine.

<?php
function getTopLevelDomain($url){
    $urlData = parse_url($url);
    $urlHost = isset($urlData['host']) ? $urlData['host'] : '';
    $isIP = (bool)ip2long($urlHost);
    if($isIP){ /** To check if it's ip then return same ip */
        return $urlHost;
    }
    /** Add/Edit you TLDs here */
    $urlMap = array('com', 'com.pk', 'co.uk');

    $host = "";
    $hostData = explode('.', $urlHost);
    if(isset($hostData[1])){ /** To check "localhost" because it'll be without any TLDs */
        $hostData = array_reverse($hostData);

        if(array_search($hostData[1] . '.' . $hostData[0], $urlMap) !== FALSE) {
            $host = $hostData[2] . '.' . $hostData[1] . '.' . $hostData[0];
        } elseif(array_search($hostData[0], $urlMap) !== FALSE) {
            $host = $hostData[1] . '.' . $hostData[0];
        }
        return $host;
    }
    return ((isset($hostData[0]) && $hostData[0] != '') ? $hostData[0] : 'error no domain'); /* You can change this error in future */
}
?>

you can use it like this

$string = 'http://googl.com.pk';
echo getTopLevelDomain( $string ) . '<br>';

$string = 'http://googl.com.pk:23';
echo getTopLevelDomain( $string ) . '<br>';

$string = 'http://googl.com';
echo getTopLevelDomain( $string ) . '<br>';

$string = 'http://googl.com:23';
echo getTopLevelDomain( $string ) . '<br>';

$string = 'http://adad.asdasd.googl.com.pk';
echo getTopLevelDomain( $string ) . '<br>';

$string = 'http://adad.asdasd.googl.com.pk:23';
echo getTopLevelDomain( $string ) . '<br>';

$string = 'http://adad.asdasd.googl.com';
echo getTopLevelDomain( $string ) . '<br>';

$string = 'http://adad.asdasd.googl.com:23';
echo getTopLevelDomain( $string ) . '<br>';

$string = 'http://192.168.0.101:23';
echo getTopLevelDomain( $string ) . '<br>';

$string = 'http://192.168.0.101';
echo getTopLevelDomain( $string ) . '<br>';

$string = 'http://localhost';
echo getTopLevelDomain( $string ) . '<br>';

$string = 'https;//';
echo getTopLevelDomain( $string ) . '<br>';

$string = '';
echo getTopLevelDomain( $string ) . '<br>';

You'll get result in string like this

googl.com.pk
googl.com.pk
googl.com
googl.com
googl.com.pk
googl.com.pk
googl.com
googl.com
192.168.0.101
192.168.0.101
localhost
error no domain
error no domain

Comments

1

I'm not a PHP developer and I know this isn't the full solution, but I think the general problem is actually identifying all of the possible public domain names.

Luckily, there is a list of public domains maintained at https://publicsuffix.org/list/. The list is broken into two sections. The first section is public domain names which includes many of those listed in these comments, such as .com and .com.au. The public domain names are delimited with ===BEGIN ICANN DOMAINS=== and ===END ICANN DOMAINS===.

If you load just the ICANN DOMAINS list then you can identify the top-level domain names. But it would take a PHP developer to explain how to do that efficiently :)

If you load the whole list then you can get information about private subdomains as well, such as those under github.io.

Comments

0

you probably want to use the public suffix list.

https://publicsuffix.org/

in php ypu can do that using the regdom libs:

https://github.com/usrflo/registered-domain-libs/

Comments

0

None of the answers here support public suffixes with 3 parts, which also exist (for example, .k12.ak.us)

Here's a more complete solution that allows for any length of public suffix:

public function getBaseDomain($domain)
    {
        if (empty($domain) || substr_count($domain, ".") < 2) {
            return $domain;
        }
        $publicSuffixes = [".com",".co.uk",".k12.ak.us", ......];
        $domainParts = explode(".", $domain);
        $checkDomain = array_pop($domainParts);

        do {
            $checkDomain = array_pop($domainParts) . "." . $checkDomain;
            if (empty($domainParts)) {
                break;
            }
        } while (array_search("." . $checkDomain, $publicSuffixes) !== false);


        return $checkDomain;
    }

Note: the code here already assumes that it's a domain, not an IP, and assumes it's a valid domain, without the https://.

For the most complete list of public suffixes available, see https://publicsuffix.org/list/public_suffix_list.dat

Comments

-3

Use this function:

function getHost($url){
    if (strpos($url,"http://")){
        $httpurl=$url;
    } else {
        $httpurl="http://".$url;
    }
    $parse = parse_url($httpurl);
    $domain=$parse['host'];

    $portion=explode(".",$domain);
    $count=sizeof($portion)-1;
    if ($count>1){
        $result=$portion[$count-1].".".$portion[$count];
    } else {
        $result=$domain;
    }
    return $result;
}

Answer all variants of example URL's.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.