4

I'm trying to make a regular expression that will correctly capture URLs, including ones that are wrapped in parenthesis as in (http://example.com) and spoken about on coding horror at https://blog.codinghorror.com/the-problem-with-urls/

I'm currently using the following to create HTML A tags in python for links that start with http and www.

r1 = r"(\b(http|https)://([-A-Za-z0-9+&@#/%?=~_()|!:,.;]*[-A-Za-z0-9+&@#/%=~_()|]))"
r2 = r"((^|\b)www\.([-A-Za-z0-9+&@#/%?=~_()|!:,.;]*[-A-Za-z0-9+&@#/%=~_()|]))"
return re.sub(r2,r'<a rel="nofollow" target="_blank" href="http://\1">\1</a>',re.sub(r1,r'<a rel="nofollow" target="_blank" href="\1">\1</a>',text))

this works well except for the case where someone wraps the url in parens. Does anyone have a better way?

1 Answer 1

4

Problem is, URLs could have parenthesis as part of them... (http://en.wikipedia.org/wiki/Tropical_Storm_Alberto_(2006)) . You can't treat that with regexp alone, since it doesn't have state. You need a parser. So your best chance would be to use a parser, and try to guess the correct close parenthesis. That is error-prone (the url could open parenthesis and never close it) so I guess you're out of luck anyway.

See also http://en.wikipedia.org/wiki/, or (http://en.wikipedia.org/wiki/)) and other similar valid URLs.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.