Regex for excluding URL

Question

I working with an email company that has a feature where they spider your site in order to provide custom content. I have the ability to have the spider ignore urls based on the regex patterns I provide.

For this system a pattern starts and ends with a "/".

What I'm trying to do is ignore http://www.website.com/2011/10 BUT allow http://www.website.com/2011/10/title-of-page.html

I would have thought the pattern below would work since it does not have a trailing slash but no luck.

Any ideas?

/http:\/\/www\.website\.com\/[0-9][0-9][0-9][0-9]\/[0-9][0-9]/

Tim Pietzcker · Accepted Answer · 2011-10-24 14:21:17Z

1

Your regex matches a part of the URL, so you need to tell it not to allow a slash to follow it:

/http:\/\/www\.website\.com\/[0-9]{4}\/[0-9][0-9](?!\/)/

If you want to also avoid other partial matches like in http://www.website.com/2011/100, then an additional word boundary might help:

/http:\/\/www\.website\.com\/[0-9]{4}\/[0-9][0-9]\b(?!\/)/

answered Oct 24, 2011 at 14:21

Tim Pietzcker

337k59 gold badges520 silver badges572 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Ofir · Accepted Answer · 2011-10-24 14:20:57Z

1

It depends on the regexp engine but you can probably either use $ (if the URL is tokenised beforehand) or a match for whitespace and delimiters

answered Oct 24, 2011 at 14:20

Ofir

8,3972 gold badges32 silver badges44 bronze badges

Collectives™ on Stack Overflow

Regex for excluding URL

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related