6

I am trying to parse URL's in PHP where the input could be any of the following:

Code:

$info = parse_url('http://www.domainname.com/');
print_r($info);

$info = parse_url('www.domain.com');
print_r($info);

$info = parse_url('/test/');
print_r($info);

$info = parse_url('test.php');
print_r($info);

Returns:

Array
(
    [scheme] => http
    [host] => www.domainname.com
    [path] => /
)
Array
(
    [path] => www.domain.com
)
Array
(
    [path] => /test/
)
Array
(
    [path] => test.php
)

The problem you can see is the second example where the domain is returned as a path.

2 Answers 2

12

This gives the right results but the file needs to start with a slash:

parse('http://www.domainname.com/');
parse('www.domain.com');
parse('/test/');
parse("/file.php");

function parse($url){
    if(strpos($url,"://")===false && substr($url,0,1)!="/") $url = "http://".$url;
    $info = parse_url($url);
    if($info)
    print_r($info);
}

and the result is :

Array
(
    [scheme] => http
    [host] => www.domainname.com
    [path] => /
)
Array
(
    [scheme] => http
    [host] => www.domain.com
)
Array
(
    [path] => /test/
)
Array
(
    [path] => /file.php
)
Sign up to request clarification or add additional context in comments.

5 Comments

just a quick one, how can I differentiate between a file name and domain name to append the leading slash?
check if there's any www preceeding it, but it may not be safe, checking it's extension - if you know all the file extension possibilities - would be better. counting the "."'s won't be safe either.
Well my code is scanning a page for links so there's no guarantee the link will have www or a subdomain or neither at all. Mammoth task if I need to check for all tld's!
If you are fetching urls from anchors in a web page, there's three possibilities: first, remote urls, they always start with "http://", second; "relative to root" urls, they always start with "/", third, "relative to current path" urls, they directly start with the path or file. You won't be running into "www.yourdomain.com" type urls in anchors.
Two more possibilities, first, inline page anchors, they start with "#", second: "javascript:" action href's.
0

To handle a URL in a way that preserves that it is was a schema-less URL, whilst also allowing a domain to be identified, use the following code.

if (!preg_match('/^([a-z][a-z0-9\-\.\+]*:)|(\/)/', $url)) {
    $url = '//' . $url;
}

So this will apply "//" to beginning of the URL only if the URL does not have a valid scheme and does not begin with "/".

Some quick background on this:

The parser assumes (valid) characters before ":" is the schema, whilst characters following "//" is the domain. To indicate the URL has both a scheme and domain, the two markers must be used consecutively, "://". For example

  • [scheme]:[path//path]
  • //[domain][/path]
  • [scheme]://[domain][/path]
  • [/path]
  • [path]

This is how PHP parses URLs with parse_url() but I couldn't say if it's to standard.

The rules for a valid scheme name is: alpha *( alpha | digit | "+" | "-" | "." )

2 Comments

preg_match(): Unknown modifier ')'
@Shardj I'm afraid I can't replicate the error you have reported. Perhaps double check you have copied the expressions correctly. I suspect you have (/) in the expression instead of (\/).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.