0

I have this regex: /href=('|")(\w+|\/dashboard)/ that matches every HTML anchor that has an href that starts with /dashboard, or something/without/a/slash/at/the/beginning.

So this regex matches:

<a href="/dashboard/security-settings"></a>
<a href='dashboard/security-settings'></a>
<a href='something/security-settings'></a>

But not:

<a href="/home"></a>
<a href="/about"></a>

The issue here is that it also matches:

<a href="http://www.google.com"></a>
<a href="www.facebook.com"></a>

How can I filter href's starting with http or www from the regex? I tried playing with the ^ operator with no luck:

href=('|")(([^http][^www]|\w+)|\/dashboard)
1
  • which language?...why use regex?use an html parser Commented Jul 4, 2013 at 13:22

2 Answers 2

1

^ within a character class works on individual letters, not strings. So [^http] actually means "Match one character that's neither an h nor a t nor a p".

You need a negative lookahead assertion instead:

href=(['"])(?!http|www)(\w+|/dashboard)
Sign up to request clarification or add additional context in comments.

Comments

0

The simplest solution is:

/^href=['"](\w+|\/dashboard)/

The ^ operator (if used at the start of the regexp) makes sure that the regexp is only matched at the beginning of the line, so it only matches strings that begin with href.

As others have mentioned you can use negative lookahead to explicitly filter out strings that begin with http or www. However, if the string would start with ftp:// (or any prefix other than "http" or "www") it would still be matched using negative lookahead for "http" and "www". It seems better to use a white list in this case rather than a black list containing everything that you don't want to match.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.