The definitive PHP url parser

Question

Before you tell me to use parse_url, it's not nearly good enough and has too many bugs. There are many questions on the subject of parsing URLs be found on here, but nearly all are to parse only some specific class of URLs or are otherwise incomplete.

I'm looking for a definitive RFC-compliant URL parser in PHP that will reliably process any URL that a browser is likely to encounter. In this I include:

Page-internal links #, #title
Page-relative URLs blah/thing.php
Site-relative URLs /blah/thing.php
Anonymous-protocol URLs //ajax.googleapis.com/ajax/libs/jquery/1.8.1/jquery.min.js
Callto URLs callto:+442079460123
File URLs file:///Users/me/thisfile.txt
Mailto URLs mailto:[email protected]?subject=hello, mailto:?subject=hello

and support for all the usual scheme/authentication/domain/path/query/fragment etc, and break all of those elements out into an array, with extra flags for relative/schemaless URLs. Ideally it would come with a URL reconstructor (like http_build_url) supporting the same elements, and I'd also like validation to be applied (i.e. it should be able to make a best-guess interpretation of a URL if it's invalid, but flag it as such, just like browsers do).

This answer contained a tantalising Fermat-style reference to such a beast, but it doesn't actually go anywhere.

I've looked in all the major frameworks, but they only seem to provide thin wrappers around parse_url which is generally a bad place to start since it makes so many mistakes.

So, does such a thing exist?

A definitive url parser would originate from a browser's native code. The examples you listed ought to be recognized valid by a browser. So, see how it's doing, port the code. — Prasanth
– Prasanth, Commented Oct 2, 2012 at 10:11
Essentially parse_url tries to parse everything as if it was HTTP, which means that it often breaks on anything that isn't. It often gets confused between domain and path, especially in schemes that don't have a domain, such as in callto, and it gets completely thrown by anonymous protocols. — Synchro
– Synchro, Commented Oct 15, 2012 at 18:33
Browsers don't use regex for parsing URLs, and I don't really fancy porting thousands of lines of C, for example in uriparser.sourceforge.net or code.google.com/p/google-url I guess it would be possible to wrap those into a PHP extension, but that's a bit beyond me. — Synchro
– Synchro, Commented Oct 15, 2012 at 18:47

Community · Accepted Answer · 2021-10-07 05:57:16Z

3

Not sure how many bugs parse_url() has, but this might help:

As the "first-match-wins" algorithm is identical to the "greedy" disambiguation method used by POSIX regular expressions, it is natural and commonplace to use a regular expression for parsing the potential five components of a URI reference.

The following line is the regular expression for breaking-down a well-formed URI reference into its components.

^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
 12            3  4          5       6  7        8 9

Source: https://www.rfc-editor.org/rfc/rfc3986#page-51

It breaks down the location as:

$2 - scheme
$4 - host
$5 - path
$6 - query string
$8 - fragment

To rebuild, you could use:

$1 . $3 . $5 . $6 . $8

edited Oct 7, 2021 at 5:57

CommunityBot

11 silver badge

answered Oct 2, 2012 at 9:34

Ja͢ck

174k39 gold badges269 silver badges317 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

millimoose Over a year ago

Does that parse a URL or merely validate one? That is, are the capture groups around meaningful pieces of data?

Ja͢ck Over a year ago

@millimoose It doesn't really validate it, just breaks it down into useful pieces; see updated answer.

Synchro Over a year ago

Thanks - even that is an improvement on parse_url!

Collectives™ on Stack Overflow

The definitive PHP url parser

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related