1

Actually my script working as expected (split string by url and maintain other text also) and put inside a list:

import re
s = 'This is my tweet check it out http://www.example.com/blah and http://blabla.com'
result = re.split(r'(https?://\S+)', s)
print(result)

Output:

['This is my tweet check it out ', 'http://www.example.com/blah', ' and ', 'http://blabla.com', '']

Now I'm stuck in another problem: sometimes I get urls as html, or mixed text+html, and url are like this:

<a href="http://www.example.com/full/path/to/product/">https://shorted.com/FJAKS</a>

href with full url, value between <a>...</a> the shortened url.

So I can receive a string like this to manipulate:

s = 'This is an html link: <a href="http://www.example.com/full/path/to/product/">https://shorted.com/FJAKS</a> and this is a text url: http://blabla.com'

I'd like to get the same logic for my function, but If I use:

result = re.split(r'(https?://\S+)', s)
print(result)

like before, I get this (WRONG):

['This is an html link: <a href="', 'http://www.example.com/full/path/to/product/">https://shorted.com/FJAKS</a>', ' and this is a text url: ', 'http://blabla.com', '']

But i'd like to get a situation like this (If it is an HTML, get all a tag):

Output expected:

['This is an html link: ', '<a href="http://www.example.com/full/path/to/product/">https://shorted.com/FJAKS</a>', ' and this is a text url: ', 'http://blabla.com', '']

1 Answer 1

1

Try:

s = 'This is an html link: <a href="http://www.example.com/full/path/to/product/">https://shorted.com/FJAKS</a> and this is a text url: http://blabla.com'
result = re.split(r'((?:<a href=")?https?://\S+[^\s,.:;])', s)
print(result)

The key is the addition of (?:<a href=")?. (?:) means a group that isn't captured; it's useful so that the ? applies to that entire unit instead of a single character.

Note: a URL at the beginning or end creates a blank list item. If you'd like to remove those, try:

result = list(filter((None, result)))

EDIT: Added [^\s,.:;] to the end of the match. The ^ ensures we'll avoid matching the final character if it's any of the specified characters. This avoids links from gobbling up punctuation directly after them, like commas.

Sign up to request clarification or add additional context in comments.

5 Comments

re.split(r'((?:<a)?(?: href=")?https?://\S+)', s) to take care of the shortened form as well.
Thanks to all. I'm not good with reg exp. Anyone can suggest a good tutorial for very very newbies?
It may seem intimidatingly thorough, but honestly I heartily recommend the official Python documentation: docs.python.org/3/library/re.html
@CrazyChucky Hi, sorry I tried your code with another string but i get a little error. The string is: This is an html link: <a href="http://www.example.com/full/path/to/product/">https://shortedlink.com</a> and this is a text url: http://blabla.com, another https://www.blabla.com and another <a href="http://www.blabla.com/test">TEST</a> .....i get this as url: http://blabla.com, <--- see the comma
@itajackass Edited.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.