6

I'm cleaning some text from Reddit. When you include a link in a Reddit self-text, you do so like this: [the text you read](https://website.com/to/go/to). I'd like to use regex to remove the hyperlink (e.g. https://website.com/to/go/to) but keep the text you read.

Here is another example:

[the podcast list](https://www.reddit.com/r/datascience/wiki/podcasts)

I'd like to keep: the podcast list.

How can I do this with Python's re library? What is the appropriate regex?

2

1 Answer 1

8

I have created an initial attempt at your requested regex:

(?<=\[.+\])\(.+\)

The first part (?<=...) is a look behind, which means it looks for it but does not match it. You can use this regex along with re's method sub. You can also see the meanings of all the regex symbols here.

You can extend the above regex to look for only things that have weblinks in the brackets, like so:

(?<=\[.+\])\(https?:\/\/.+\)

The problem with this is that if the link they provide is not started with an http or https it will fail.

After this you will need to remove the square brackets, maybe just removing all square brackets works fine for you.


Edit 1:

Valentino pointed out that substitute accepts capturing groups, which lets you capture the text and substitute the text back in using the following regex:

\[(.+)\]\(.+\)

You can then substitute the first captured group (in the square brackets) back in using:

re.sub(r"\[(.+)\]\(.+\)", r"\1", original_text)

If you want to look at the regex in more detail (if you're new to regex or want to learn what they mean) I would recommend an online regex interpreter, they explain what each symbol does and it makes it much easier to read (especially when there are lots of escaped symbols like there are here).

Sign up to request clarification or add additional context in comments.

3 Comments

Also, using re.sub() the lookbehind can be avoided. re.sub(r"\[(.+)\]\(.+\)", r"\1", original_text) this will substitute the contents of the square bracket to the whole match. Just another way to it similar to yours.
Thanks @Valentino, added!
This solution only works for up to 1 link in original_text. By matching smallest group within the brackets instead it works nicely for more than 1 link as well: \[(.+?)\]\(.+?\)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.