2

I current have str.match(/(http[^\s]+)/i) which not only captures link in the content, but also in img tag(src="http...") and anchor tag(href="http...")

How do I modify my regex so that it matches only "http/s" that has no "src=" or "href=" before it?

4
  • stackoverflow.com/questions/4643142/… Commented Apr 22, 2015 at 20:15
  • May be easiest to just get all text nodes first and search only those but it depends on what you're doing. Commented Apr 22, 2015 at 20:15
  • can you put some sample data? Commented Apr 22, 2015 at 20:16
  • 1
    Maybe parsing HTML with regular expressions isn't a really good idea, and you should get the proper elements, then the text from those elements, before you use a regex ? Commented Apr 22, 2015 at 20:18

3 Answers 3

3

You can use an additional \s. href or src will not have a whitespace character before the URL. In normal text, there is a whitespace.

str.match(/\s(http[^\s]+)/i)

Also see DEMO

Sign up to request clarification or add additional context in comments.

Comments

1

You can catch links that don't start with an = nor a quote before the http/s:

str.match(/[^=\"](http[^\s]+)/i)

Comments

0

You can overmatch using simple http[^\s]+ (=http\S+).

I'd suggest to use a regex to match text outside of tags, and whitelist those tags where you allow the text to appear. Here is the regex:

/(?![^<]*>|[^<>]*<\/(?!p\b|td|pre))https?:\/\/[a-z0-9&#=.\/\-?_]+/gi

(?!p\b|td|pre) part is where we add whitelisted tags. The regex won't capture http://example.com,.

See demo

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.