Find Hyperlinks in Text using Python (twitter related)

Question

How can I parse text and find all instances of hyperlinks with a string? The hyperlink will not be in the html format of <a href="http://test.com">test</a> but just http://test.com

Secondly, I would like to then convert the original string and replace all instances of hyperlinks into clickable html hyperlinks.

I found an example in this thread:

Easiest way to convert a URL to a hyperlink in a C# string?

but was unable to reproduce it in python :(

You should use example.com for example URLs. See en.wikipedia.org/wiki/Example.com — John Fouhy
– John Fouhy, Commented Apr 6, 2009 at 3:29
Thanks John! I did not know that those are official example domains. — Dan Rosenstark
– Dan Rosenstark, Commented Dec 24, 2009 at 13:40

Community · Accepted Answer · 2017-05-23 12:30:26Z

23

Here's a Python port of Easiest way to convert a URL to a hyperlink in a C# string?:

import re

myString = "This is my tweet check it out http://tinyurl.com/blah"

r = re.compile(r"(http://[^ ]+)")
print r.sub(r'<a href="\1">\1</a>', myString)

Output:

This is my tweet check it out <a href="http://tinyurl.com/blah">http://tinyurl.com/blah</a>

edited May 23, 2017 at 12:30

CommunityBot

11 silver badge

answered Apr 6, 2009 at 2:53

maxyfc

11.4k7 gold badges39 silver badges47 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

bortzmeyer Over a year ago

It can be improved by adding support for https or ftp URLs... Also, I believe the scheme (http) is case-INsensitive.

tripleee Over a year ago

See stackoverflow.com/questions/1986059/… for hopefully a better regular expression.

dfrankow · Accepted Answer · 2021-08-27 18:20:31Z

10

Here is a much more sophisticated regexp from 2002.

@yoniLavi minified this to:

re.compile(r'\b(?:https?|telnet|gopher|file|wais|ftp):[\w/#~:.?+=&%@!\-.:?\\-]+?(?=[.:?\-]*(?:[^\w/#~:.?+=&%@!\-.:?\-]|$))')

edited Aug 27, 2021 at 18:20

answered Jan 20, 2010 at 15:45

dfrankow

21.7k44 gold badges167 silver badges246 bronze badges

2 Comments

yoniLavi Over a year ago

I found it very useful too, and minified it to:

re.compile(r'\b(?:https?|telnet|gopher|file|wais|ftp):[\w/#~:.?+=&%@!\-.:?\\-]+?(?=[.:?\-]*(?:[^\w/#~:.?+=&%@!\-.:?\-]|$))')

dlink Over a year ago

Great stuff, but what if the URL does not have the http:// prefix. Usually we don't specify that part any more in emails and social media.

Erock · Accepted Answer · 2016-07-28 16:27:17Z

5

Django also has a solution that doesn't just use regex. It is django.utils.html.urlize(). I found this to be very helpful, especially if you happen to be using django.

You can also extract the code to use in your own project.

edited Jul 28, 2016 at 16:27

Erock

7807 silver badges10 bronze badges

answered Jan 24, 2012 at 6:16

Kekoa

28.4k14 gold badges77 silver badges91 bronze badges

Comments

jmoz · Accepted Answer · 2012-10-25 22:57:03Z

2

Jinja2 (Flask uses this) has a filter urlize which does the same.

Docs

answered Oct 25, 2012 at 22:57

jmoz

8,0165 gold badges33 silver badges33 bronze badges

Comments

dfrankow · Accepted Answer · 2024-01-04 21:53:18Z

2

Have a look at urlextract.

You can install it running: pip install urlextract

from urlextract import URLExtract

extractor = URLExtract()
urls = extractor.find_urls("Text with URLs. Let's have URL janlipovsky.cz as an example.")
print(urls) # prints: ['janlipovsky.cz']

The main advantage is that urlextract will find URLs without specifying schema (http, ftp, etc.) It also has a lot of configuration options to tune in the extractor to fit your needs. Everything can be found in documentation.

edited Jan 4, 2024 at 21:53

dfrankow

21.7k44 gold badges167 silver badges246 bronze badges

answered Jan 2, 2023 at 14:04

Jan Lipovský

3712 silver badges5 bronze badges

2 Comments

dfrankow Over a year ago

I like this library. As of this moment, it was last updated Dec 2022.

Jan Lipovský Over a year ago

I know - lack of free time last year - I do not have it as much as years before - I welcome any PR with improvements and bug fixes. I hope that I will find time for maintenance this year :)

Collectives™ on Stack Overflow

Find Hyperlinks in Text using Python (twitter related)

5 Answers 5

2 Comments

2 Comments

Comments

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

2 Comments

2 Comments

Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related