URL regex excluding a specific domain not matching correctly

Question

I'm trying to match some expression with regex but it's not working. I want to match a string not starting with http://www.domain.com. Here is my regex :

^https?:\/\/(www\.)?(?!domain\.com)

Is there a problem with my regex?

I want to match expression starting with http:// but different from http://site.com For example:

/page.html => false
http://www.google.fr => true
http://site.com => false
http://site.com/page.html => false

^ outside a character class means "start of line", not "not". — Wooble
– Wooble, Commented Mar 27, 2013 at 15:47
Can you post an example of what you expect to/not to match but doesn't/does? The regex looks reasonable. Also there's no need to escape /. — FatalError
– FatalError, Commented Mar 27, 2013 at 15:49

Daedalus · Accepted Answer · 2013-03-27 16:02:58Z

7

Use this to match a URL that does not have the domain you mention: https?://(?!(www\.domain\.com\/?)).*

Example in action: http://regexr.com?34a7p

answered Mar 27, 2013 at 16:02

Daedalus

1,67710 silver badges12 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

JonM · Accepted Answer · 2013-03-27 16:17:11Z

1

The problem here is that when the regex engine encounters the successful match on the negative look-ahead it will treat the match as a failure (as expected) and backtrack to the previous group (www\.) quantified as optional and then see if the expression is successful without it. This is what you have over looked.

This could be fixed with the application of atomic grouping or possessive quantifiers to 'forget' the possibility of backtracking. Unfortunately python regex doesn't support this natively. Instead you'll have to use a much less efficient method: using a larger look-ahead.

^https?:\/\/(?!(www\.)?(domain\.com))

edited Mar 27, 2013 at 16:17

answered Mar 27, 2013 at 16:06

JonM

1,37411 silver badges14 bronze badges

2 Comments

Martijn Pieters Over a year ago

The OP still needs to match lines starting with http:// or https://, just not with the domain name.

JonM Over a year ago

Good point, while it shouldn't have an effect on the overall results of the expression, it could potentially make it much less efficient. I have changed the answer to reflect this.

Martijn Pieters · Accepted Answer · 2013-03-27 16:08:00Z

0

You want a negative look-ahead assertion:

^https?://(?!(?:www\.)?site\.com).+

Which gives:

>>> testdata = '''\
... /page.html => false
... http://www.google.fr => true
... http://site.com => false
... http://site.com/page.html => false
... '''.splitlines()
>>> not_site_com = re.compile(r'^https?://(?!(?:www\.)?site\.com).+')
>>> for line in testdata:
...     match = not_site_com.search(line)
...     if match: print match.group()
... 
http://www.google.fr => true

Note that the pattern excludes both www.site.com and site.com:

>>> not_site_com.search('https://www.site.com')
>>> not_site_com.search('https://site.com')
>>> not_site_com.search('https://site-different.com')
<_sre.SRE_Match object at 0x10a548510>

edited Mar 27, 2013 at 16:08

answered Mar 27, 2013 at 15:55

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

1 Comment

Martijn Pieters Over a year ago

@guillaume: right, then still you need a negative look-ahead assertion.

Collectives™ on Stack Overflow

URL regex excluding a specific domain not matching correctly

3 Answers 3

Comments

2 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related