How to detect and remove a link inside a string in python3

Question

I have strings which may (or may not) contain links. If the link exist, it is surrounded by [link] [/link] tokens. I would like to replace those parts with some special token such as URL. and return the corresponding link.

Example

Let's assume the function detect_link does this:

>input= 'The statement [link] http://www.washingtontimes.com/news/2017/sep/9/rob-ranco-texas-lawyer-says-he-would-be-ok-if-bets/ [/link] The Washington Times'
>replacement_token = "URL"
>link,new_sentence = detect_link(input,replacement_token)
>link
'http://www.washingtontimes.com/news/2017/sep/9/rob-ranco-texas-lawyer-says-he-would-be-ok-if-bets/'
>new_sentence
'The statement URL The Washington Times'

I searched a little bit and found that Regular expressions can be used to do that. However, I do not any experience with them. Can someone help me about that ?

EDIT Links don't have any constant pattern. It may or may not start with http. It may or may not end with .com etc

@Xorifelse It seems working. Do you know how can I integrate it into python code ? — zwlayer
– zwlayer, Commented Oct 20, 2018 at 10:37

Patrick Artner · Accepted Answer · 2018-10-20 13:36:43Z

2

You need a regex pattern for that. I use http://www.regex101.com to play around with regexes.

You can use that pattern to extract things and replace things like so:

import re

text = 'The statement [link] http://www.washingtontimes.com/news/2017/sep/9/rob-ranco-texas-lawyer-says-he-would-be-ok-if-bets/ [/link] The Washington Times'

# get what what matched
for mat in re.findall(r"\[link\](.*?)\[/link\]",text):
    print(mat)

# replace a match with sthm other
print( re.sub(r"\[link\](.*?)\[/link\]","[URL]",text))

Output:

http://www.washingtontimes.com/news/2017/sep/9/rob-ranco-texas-lawyer-says-he-would-be-ok-if-bets/ 

The statement [URL] The Washington Times

The pattern I use is non-greedy, so it wont match multiple [link][/link] parts if they occure in one sentence but only the shortest ones:

\[link\](.*?)\[/link\]   - matches a literal [ followed by link followed by literal ]
                           with as few things before matching the endtag [/link]

Without non-greedy matches you only get one replace for the whole of

The statement [link] http://www.washingtontimes.com/news/2017/sep/9/rob-ranco-texas-lawyer-says-he-would-be-ok-if-bets/ [/link] and this also [link] http://www.washingtontimes.com/news/2017/sep/9/rob-ranco-texas-lawyer-says-he-would-be-ok-if-bets/ [/link] The Washington Times

instead of two.

find all links:

import re
text = """
The statement [link] link 1 [/link] and [link] link 2 [/link] The Washington Times
The statement [link] link 3 [/link] and [link] link 4 [/link] The Washington Times
"""

# get what what matched
links = re.findall(r"\[link\](.*)\[/link\]",text)        # greedy pattern
links_lazy = re.findall(r"\[link\](.*?)\[/link\]",text)  # lazy pattern

Output:

# greedy
[' link 1 [/link] and [link] link 2 ', 
 ' link 3 [/link] and [link] link 4 ']
# lazy
[' link 1 ', ' link 2 ', ' link 3 ', ' link 4 ']

The difference is visible if you do not include newlines in the text-to-match - the (*.) does not match newlines - so if you have multiple links in a sentence you need a (.*?) match to get both as single match instead of getting the whole part matched.

edited Oct 20, 2018 at 13:36

answered Oct 20, 2018 at 10:38

Patrick Artner

51.9k10 gold badges50 silver badges79 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

zwlayer Over a year ago

thank you for your response. What should I do if I want to capture all the links if there exist more than one in a sentence ?

Patrick Artner Over a year ago

@zwlayer see edit - with a lazy evaluated pattern (,*?) instead of (.*) this will work

Collectives™ on Stack Overflow

How to detect and remove a link inside a string in python3

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related