Python split string by urls with and without a href

Question

Actually my script working as expected (split string by url and maintain other text also) and put inside a list:

import re
s = 'This is my tweet check it out http://www.example.com/blah and http://blabla.com'
result = re.split(r'(https?://\S+)', s)
print(result)

Output:

['This is my tweet check it out ', 'http://www.example.com/blah', ' and ', 'http://blabla.com', '']

Now I'm stuck in another problem: sometimes I get urls as html, or mixed text+html, and url are like this:

<a href="http://www.example.com/full/path/to/product/">https://shorted.com/FJAKS</a>

href with full url, value between <a>...</a> the shortened url.

So I can receive a string like this to manipulate:

s = 'This is an html link: <a href="http://www.example.com/full/path/to/product/">https://shorted.com/FJAKS</a> and this is a text url: http://blabla.com'

I'd like to get the same logic for my function, but If I use:

result = re.split(r'(https?://\S+)', s)
print(result)

like before, I get this (WRONG):

['This is an html link: <a href="', 'http://www.example.com/full/path/to/product/">https://shorted.com/FJAKS</a>', ' and this is a text url: ', 'http://blabla.com', '']

But i'd like to get a situation like this (If it is an HTML, get all a tag):

Output expected:

['This is an html link: ', '<a href="http://www.example.com/full/path/to/product/">https://shorted.com/FJAKS</a>', ' and this is a text url: ', 'http://blabla.com', '']

CrazyChucky · Accepted Answer · 2020-05-14 06:09:11Z

1

Try:

s = 'This is an html link: <a href="http://www.example.com/full/path/to/product/">https://shorted.com/FJAKS</a> and this is a text url: http://blabla.com'
result = re.split(r'((?:<a href=")?https?://\S+[^\s,.:;])', s)
print(result)

The key is the addition of (?:<a href=")?. (?:) means a group that isn't captured; it's useful so that the ? applies to that entire unit instead of a single character.

Note: a URL at the beginning or end creates a blank list item. If you'd like to remove those, try:

result = list(filter((None, result)))

EDIT: Added [^\s,.:;] to the end of the match. The ^ ensures we'll avoid matching the final character if it's any of the specified characters. This avoids links from gobbling up punctuation directly after them, like commas.

edited May 14, 2020 at 6:09

answered Apr 28, 2020 at 14:46

CrazyChucky

3,5574 gold badges16 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

ywbaek Over a year ago

re.split(r'((?:<a)?(?: href=")?https?://\S+)', s) to take care of the shortened form as well.

Giuseppe Lodi Rizzini Over a year ago

Thanks to all. I'm not good with reg exp. Anyone can suggest a good tutorial for very very newbies?

CrazyChucky Over a year ago

It may seem intimidatingly thorough, but honestly I heartily recommend the official Python documentation: docs.python.org/3/library/re.html

itajackass Over a year ago

@CrazyChucky Hi, sorry I tried your code with another string but i get a little error. The string is:

This is an html link: <a href="http://www.example.com/full/path/to/product/">https://shortedlink.com</a> and this is a text url: http://blabla.com, another https://www.blabla.com and another <a href="http://www.blabla.com/test">TEST</a>

.....i get this as url: http://blabla.com, <--- see the comma

CrazyChucky Over a year ago

@itajackass Edited.

Collectives™ on Stack Overflow

Python split string by urls with and without a href

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related