3

I'm trying to write a regex that will capture the domain and path from a URL. I've tried:

https?:\/\/(.+)(\/.*)

That works fine for http://example.com/foo:

Match 1
0.  google.com
1.  /foo

But not what I would expect for http://example.com/foo/bar:

Expected:

Match 1
0.  google.com
1.  /foo/bar

Actual:

Match 1
0.  google.com/foo
1.  /bar

What am I doing wrong?

3
  • 3
    Is there any reason you want to do this with a regex? The urlparse module from the standard library does this and more. Commented Jan 31, 2014 at 21:18
  • Related question that may help: stackoverflow.com/questions/27745/getting-parts-of-a-url-regex Commented Jan 31, 2014 at 21:21
  • @DanielRoseman urlparse does a nice job of breaking up the URL, but I want the path including queries, parameters, and fragments. That will be useful for other cases. Thanks! Commented Feb 3, 2014 at 13:50

3 Answers 3

6

As noted - this is a non griddy version: https?:\/\/(.+?)(\/.*)

Sign up to request clarification or add additional context in comments.

Comments

6

https?:\/\/(.+)(\/.*)

What am I doing wrong?

+ is greedy. You should use it on [^/] instead of a dot.

Also notice that your “path” part will contain also query string and fragment (hash).

This one gets just the domain (+ login, password, port) and path (without query string or fragment).

^https?://([^/]+)(/[^?#]*)?

I leave escaping the slashes accordingly up to you.

Caveat: This expects a valid URI and for such it is good and parses the authority and path sections. If you want to parse a URI according to the standard, you need to implement the whole grammar or get the official regex from §8 of RFC 2396.

The following line is the regular expression for breaking-down a URI reference into its components.

   ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
    12            3  4          5       6  7        8 9

The numbers in the second line above are only to assist readability; they indicate the reference points for each subexpression (i.e., each paired parenthesis). We refer to the value matched for subexpression as $. For example, matching the above expression to

   http://www.ics.uci.edu/pub/ietf/uri/#Related

results in the following subexpression matches:

   $1 = http:
   $2 = http
   $3 = //www.ics.uci.edu
   $4 = www.ics.uci.edu
   $5 = /pub/ietf/uri/
   $6 = <undefined>
   $7 = <undefined>
   $8 = #Related
   $9 = Related

where indicates that the component is not present, as is the case for the query component in the above example. Therefore, we can determine the value of the four components and fragment as

   scheme    = $2
   authority = $4
   path      = $5
   query     = $7
   fragment  = $9

9 Comments

Don't need the \/[^?]* part because that class will match the first / past the domain. If you require it, the regex will fail if that is not there in the string.
@sln Not a long ago (several weeks) I looked it up in the RFC. The slash after domain (and optional port number) is obligatory. If any URL does not have it… well, it is not a URL. If you want to be as forgiving as possible, change it to (\/[^?]*)?.
Its a stickler. If this ^https?://([^/]+)([^?]*), the first character of capture group 2 will have no choice but to be a /, otherwise capture group 1 will have captured to the end of the string, leaving capture group 2 empty. I agree with you, but why fail the regex, when you can check the length of group 2 and still get some info on the domain.
@sln True in this case as the regex engine never needs to backtrack. But still it is more clear to write the slash there. I think it is not immediately obvious that backtracking cannot be done and in case it was done, the first group would not need to end immediately before a slash. Possessive + would be needed (++).
@sln I read RFC 1738 on 2014-01-04 and now I realize I forgot what it says. §3.1 and more explicitly §3.3 say that the slash between port and path is not part of path and it is required if and only if something follows (path or query string). Fragment (hash) is not a part of URL, the standard does not speak about it. It is part of URI reference, defined in URI standard, RFC 2396. Reading its §3 now, I realize that a few things changed. The slash is now part of path and may be omitted even when query string is not.
|
0

Something like this 'greedy' version might work. I don't know if Python requires delimiters, so this is just the raw regex.

 #   https?://([^/]+)(.*)

 https?://
 ( [^/]+ )           # (1)
 ( .* )              # (2)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.