Capture domain and path from URL with regex

Question

I'm trying to write a regex that will capture the domain and path from a URL. I've tried:

https?:\/\/(.+)(\/.*)

That works fine for http://example.com/foo:

Match 1
0.  google.com
1.  /foo

But not what I would expect for http://example.com/foo/bar:

Expected:

Match 1
0.  google.com
1.  /foo/bar

Actual:

Match 1
0.  google.com/foo
1.  /bar

What am I doing wrong?

Is there any reason you want to do this with a regex? The urlparse module from the standard library does this and more. — Daniel Roseman
– Daniel Roseman, Commented Jan 31, 2014 at 21:18
Related question that may help: stackoverflow.com/questions/27745/getting-parts-of-a-url-regex — dcp
– dcp, Commented Jan 31, 2014 at 21:21
@DanielRoseman urlparse does a nice job of breaking up the URL, but I want the path including queries, parameters, and fragments. That will be useful for other cases. Thanks! — Sean W.
– Sean W., Commented Feb 3, 2014 at 13:50

GabiMe · Accepted Answer · 2014-01-31 21:26:54Z

6

As noted - this is a non griddy version: https?:\/\/(.+?)(\/.*)

answered Jan 31, 2014 at 21:26

GabiMe

18.5k29 gold badges80 silver badges116 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Palec · Accepted Answer · 2014-02-02 03:16:26Z

6

https?:\/\/(.+)(\/.*)

…

What am I doing wrong?

+ is greedy. You should use it on [^/] instead of a dot.

Also notice that your “path” part will contain also query string and fragment (hash).

This one gets just the domain (+ login, password, port) and path (without query string or fragment).

^https?://([^/]+)(/[^?#]*)?

I leave escaping the slashes accordingly up to you.

Caveat: This expects a valid URI and for such it is good and parses the authority and path sections. If you want to parse a URI according to the standard, you need to implement the whole grammar or get the official regex from §8 of RFC 2396.

The following line is the regular expression for breaking-down a URI reference into its components.
   ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
    12            3  4          5       6  7        8 9
The numbers in the second line above are only to assist readability; they indicate the reference points for each subexpression (i.e., each paired parenthesis). We refer to the value matched for subexpression as $. For example, matching the above expression to
   http://www.ics.uci.edu/pub/ietf/uri/#Related
results in the following subexpression matches:
   $1 = http:
   $2 = http
   $3 = //www.ics.uci.edu
   $4 = www.ics.uci.edu
   $5 = /pub/ietf/uri/
   $6 = <undefined>
   $7 = <undefined>
   $8 = #Related
   $9 = Related
where indicates that the component is not present, as is the case for the query component in the above example. Therefore, we can determine the value of the four components and fragment as
   scheme    = $2
   authority = $4
   path      = $5
   query     = $7
   fragment  = $9

edited Feb 2, 2014 at 3:16

answered Jan 31, 2014 at 21:25

Palec

13.8k8 gold badges80 silver badges145 bronze badges

9 Comments

user557597 Over a year ago

Don't need the \/[^?]* part because that class will match the first / past the domain. If you require it, the regex will fail if that is not there in the string.

Palec Over a year ago

@sln Not a long ago (several weeks) I looked it up in the RFC. The slash after domain (and optional port number) is obligatory. If any URL does not have it… well, it is not a URL. If you want to be as forgiving as possible, change it to (\/[^?]*)?.

user557597 Over a year ago

Its a stickler. If this ^https?://([^/]+)([^?]*), the first character of capture group 2 will have no choice but to be a /, otherwise capture group 1 will have captured to the end of the string, leaving capture group 2 empty. I agree with you, but why fail the regex, when you can check the length of group 2 and still get some info on the domain.

Palec Over a year ago

@sln True in this case as the regex engine never needs to backtrack. But still it is more clear to write the slash there. I think it is not immediately obvious that backtracking cannot be done and in case it was done, the first group would not need to end immediately before a slash. Possessive + would be needed (++).

Palec Over a year ago

@sln I read RFC 1738 on 2014-01-04 and now I realize I forgot what it says. §3.1 and more explicitly §3.3 say that the slash between port and path is not part of path and it is required if and only if something follows (path or query string). Fragment (hash) is not a part of URL, the standard does not speak about it. It is part of URI reference, defined in URI standard, RFC 2396. Reading its §3 now, I realize that a few things changed. The slash is now part of path and may be omitted even when query string is not.

|

user557597 · Accepted Answer · 2014-01-31 21:30:07Z

0

Something like this 'greedy' version might work. I don't know if Python requires delimiters, so this is just the raw regex.

 #   https?://([^/]+)(.*)

 https?://
 ( [^/]+ )           # (1)
 ( .* )              # (2)

answered Jan 31, 2014 at 21:30

user557597

Collectives™ on Stack Overflow

Capture domain and path from URL with regex

3 Answers 3

Comments

9 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

9 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related