1

I know there are many solutions, articles and libraries for this case, but couldn't find one to match my case. I'm trying to write a regex to extract a URL(which represent the website) from a text (a signature of a person in an email), and has multiple cases:

  • Could contain http(s):// , or not
  • Could contain www. , or not
  • Could have multiple TLD such as "test.com.cn"

Here are some examples:

www.test.com
https://test.com.cn
http://www.test.com.cn
test.com
test.com.cn

I've come up with the following regex:

(https?://)?(www\.)?\w{2,}\.[a-zA-Z]{2,}(\.[a-zA-Z]{2,})?$

But there are two main problems with this, because the signature can contain an email address:

  1. It (wrongly) capture the TLDs of emails like this one: [email protected]
  2. It doesn't capture URLS in the middle of a line, and if I remove the $ sign at the end, it captures the name.surname part of the last example

For (1) I tried using negative lookbehind, adding this (?<!@) to the beginning, the problem is that now it captures est2.com instead of not matching it at all.

1 Answer 1

2

I think you could use \b (boundary) instead of $ (and at the beginning as well) and exclude @ in negative lookbehind and lookahead:

(?<!@|\.|-)\b(https?://)?(www\.)?\w{2,}\.[a-zA-Z]{2,}(\.[a-zA-Z]{2,})?\b(?!@|\.|-)

Edit: exclude the dot (and all non alphanumeric characters likely to occur in an URL/email address) in your lookarounds to avoid matching name.middlename in [email protected] or com.cn in [email protected]. See this answer for the list of characters

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks, it almost works but now it catches com.cn in [email protected]
Haha right! my edit should apply to the lookbehind as well! Changing it right away

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.