1

I have strings which may (or may not) contain links. If the link exist, it is surrounded by [link] [/link] tokens. I would like to replace those parts with some special token such as URL. and return the corresponding link.

Example

Let's assume the function detect_link does this:

>input= 'The statement [link] http://www.washingtontimes.com/news/2017/sep/9/rob-ranco-texas-lawyer-says-he-would-be-ok-if-bets/ [/link] The Washington Times'
>replacement_token = "URL"
>link,new_sentence = detect_link(input,replacement_token)
>link
'http://www.washingtontimes.com/news/2017/sep/9/rob-ranco-texas-lawyer-says-he-would-be-ok-if-bets/'
>new_sentence
'The statement URL The Washington Times'

I searched a little bit and found that Regular expressions can be used to do that. However, I do not any experience with them. Can someone help me about that ?

EDIT Links don't have any constant pattern. It may or may not start with http. It may or may not end with .com etc

3
  • Would this do? regex101.com/r/9zpgGy/1 Commented Oct 20, 2018 at 10:36
  • @Xorifelse It seems working. Do you know how can I integrate it into python code ? Commented Oct 20, 2018 at 10:37
  • By using re, explained below. Commented Oct 20, 2018 at 10:45

1 Answer 1

2

You need a regex pattern for that. I use http://www.regex101.com to play around with regexes.

You can use that pattern to extract things and replace things like so:

import re

text = 'The statement [link] http://www.washingtontimes.com/news/2017/sep/9/rob-ranco-texas-lawyer-says-he-would-be-ok-if-bets/ [/link] The Washington Times'

# get what what matched
for mat in re.findall(r"\[link\](.*?)\[/link\]",text):
    print(mat)

# replace a match with sthm other
print( re.sub(r"\[link\](.*?)\[/link\]","[URL]",text))

Output:

http://www.washingtontimes.com/news/2017/sep/9/rob-ranco-texas-lawyer-says-he-would-be-ok-if-bets/ 

The statement [URL] The Washington Times

The pattern I use is non-greedy, so it wont match multiple [link][/link] parts if they occure in one sentence but only the shortest ones:

\[link\](.*?)\[/link\]   - matches a literal [ followed by link followed by literal ]
                           with as few things before matching the endtag [/link]

Without non-greedy matches you only get one replace for the whole of

The statement [link] http://www.washingtontimes.com/news/2017/sep/9/rob-ranco-texas-lawyer-says-he-would-be-ok-if-bets/ [/link] and this also [link] http://www.washingtontimes.com/news/2017/sep/9/rob-ranco-texas-lawyer-says-he-would-be-ok-if-bets/ [/link] The Washington Times

instead of two.


find all links:

import re
text = """
The statement [link] link 1 [/link] and [link] link 2 [/link] The Washington Times
The statement [link] link 3 [/link] and [link] link 4 [/link] The Washington Times
"""

# get what what matched
links = re.findall(r"\[link\](.*)\[/link\]",text)        # greedy pattern
links_lazy = re.findall(r"\[link\](.*?)\[/link\]",text)  # lazy pattern

Output:

# greedy
[' link 1 [/link] and [link] link 2 ', 
 ' link 3 [/link] and [link] link 4 ']
# lazy
[' link 1 ', ' link 2 ', ' link 3 ', ' link 4 ']

The difference is visible if you do not include newlines in the text-to-match - the (*.) does not match newlines - so if you have multiple links in a sentence you need a (.*?) match to get both as single match instead of getting the whole part matched.

Sign up to request clarification or add additional context in comments.

2 Comments

thank you for your response. What should I do if I want to capture all the links if there exist more than one in a sentence ?
@zwlayer see edit - with a lazy evaluated pattern (,*?) instead of (.*) this will work

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.