Actually my script working as expected (split string by url and maintain other text also) and put inside a list:
import re
s = 'This is my tweet check it out http://www.example.com/blah and http://blabla.com'
result = re.split(r'(https?://\S+)', s)
print(result)
Output:
['This is my tweet check it out ', 'http://www.example.com/blah', ' and ', 'http://blabla.com', '']
Now I'm stuck in another problem: sometimes I get urls as html, or mixed text+html, and url are like this:
<a href="http://www.example.com/full/path/to/product/">https://shorted.com/FJAKS</a>
href with full url, value between <a>...</a> the shortened url.
So I can receive a string like this to manipulate:
s = 'This is an html link: <a href="http://www.example.com/full/path/to/product/">https://shorted.com/FJAKS</a> and this is a text url: http://blabla.com'
I'd like to get the same logic for my function, but If I use:
result = re.split(r'(https?://\S+)', s)
print(result)
like before, I get this (WRONG):
['This is an html link: <a href="', 'http://www.example.com/full/path/to/product/">https://shorted.com/FJAKS</a>', ' and this is a text url: ', 'http://blabla.com', '']
But i'd like to get a situation like this (If it is an HTML, get all a tag):
Output expected:
['This is an html link: ', '<a href="http://www.example.com/full/path/to/product/">https://shorted.com/FJAKS</a>', ' and this is a text url: ', 'http://blabla.com', '']