1

I working with an email company that has a feature where they spider your site in order to provide custom content. I have the ability to have the spider ignore urls based on the regex patterns I provide.

For this system a pattern starts and ends with a "/".

What I'm trying to do is ignore http://www.website.com/2011/10 BUT allow http://www.website.com/2011/10/title-of-page.html

I would have thought the pattern below would work since it does not have a trailing slash but no luck.

Any ideas?

/http:\/\/www\.website\.com\/[0-9][0-9][0-9][0-9]\/[0-9][0-9]/

2 Answers 2

1

Your regex matches a part of the URL, so you need to tell it not to allow a slash to follow it:

/http:\/\/www\.website\.com\/[0-9]{4}\/[0-9][0-9](?!\/)/

If you want to also avoid other partial matches like in http://www.website.com/2011/100, then an additional word boundary might help:

/http:\/\/www\.website\.com\/[0-9]{4}\/[0-9][0-9]\b(?!\/)/
Sign up to request clarification or add additional context in comments.

Comments

1

It depends on the regexp engine but you can probably either use $ (if the URL is tokenised beforehand) or a match for whitespace and delimiters

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.