Split a string but keep the delimiter in the same resulting substring in Python

Question

I have a string containing URLs:

string = https://www.link1.net/abc/cik?xai=En8MmT__aF_nQm-F48&sig=Cg0A7_5AE&urlfix=1&;ccurl=https://aax-us.link-two.com/x/c/Qoj_sZnkA%2526adurl%253Dhttp%253A%252F%252Fwww.link-three.mu%252F

I want to extract all of them to have a result like this:

['https://www.link1.net/abc/cik?xai=En8MmT__aF_nQm-F48&sig=Cg0A7_5AE&urlfix=1&;ccurl=','https://aax-us.link-two.com/x/c/Qoj_sZnkA%2526adurl%253D','http%253A%252F%252Fwww.link-three.mu%252F']

I am trying:

urls = [x for x in re.split('(http[s]?)', string) if x]
print urls

And the result is:

['https', '://www.link1.net/abc/cik?xai=En8MmT__aF_nQm- F48&sig=Cg0A7_5AE&urlfix=1&;ccurl=', 'https', '://aax-us.link-two.com/x/c/Qoj_sZnkA%2526adurl%253D', 'http', '%253A%252F%252Fwww.link-three.mu%252F']

How can I get the the complete URL together given that it can start with 'http' or 'https'?

Any ideas please?

Use a lookahead (?=http). Also, no need to put s in a set [s] as it's interpreted literally by default (it doesn't have special meaning alone). Also, no need to check for s since http is all you really need to look for (think about it, who cares if there's an s at the end of http if http exists - it already satisfies your first requirement). — ctwheels
– ctwheels, Commented Feb 7, 2018 at 20:25
What is or are the URLs that you try to match? Where do they end? Do you consider the one starting with http%253a a valid URL? — Jongware
– Jongware, Commented Feb 7, 2018 at 20:28
This is a single url https://aax-us.link-two.com/x/c/Qoj_sZnkA%2526adurl%253Dhttp%253A%252F%252Fwww.link-three.mu%252F — user557597
– user557597, Commented Feb 7, 2018 at 20:47
Yes, you're right, http is all I really need. So the whole string comes from a URL redirection scheme in which I need to extract all the URLs in the chain. Now, I am decoding the urls before splitting them so all the urls are valid in the form of http://. — Adrian
– Adrian, Commented Feb 16, 2018 at 15:40

Emre Külah · Accepted Answer · 2018-02-07 20:43:19Z

2

Without using re, you can handle this problem as follows:

['http' + x for x in filter(lambda x: x, string.split('http'))]

The result will be:

['https://www.link1.net/abc/cik?xai=En8MmT__aF_nQm-
F48&amp;sig=Cg0A7_5AE&amp;urlfix=1&amp;;ccurl=', 'https://aax-us.link-
two.com/x/c/Qoj_sZnkA%2526adurl%253D', 'http%253A%252F%252Fwww.link-
three.mu%252F']

answered Feb 7, 2018 at 20:43

Emre Külah

541 silver badge8 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

colopop Over a year ago

I believe string methods are typically faster than re, so this is a better solution than the others presented so far.

Jean-François Fabre Over a year ago

use filter(None, string.split('http')), it's even cleaner. Otherwise good alternative to regex

Adrian Over a year ago

Yes, faster indeed!

Jean-François Fabre · Accepted Answer · 2018-02-07 20:41:01Z

1

You could use your result and join 2 consecutive matches, that would work.

urls = [urls[i]+urls[i+1] for i in range(0,len(urls),2)]

But better use findall with a lookahead on https? or end of string:

import re

string = "https://www.link1.net/abc/cik?xai=En8MmT__aF_nQm-F48&amp;sig=Cg0A7_5AE&amp;urlfix=1&amp;;ccurl=https://aax-us.link-two.com/x/c/Qoj_sZnkA%2526adurl%253Dhttp%253A%252F%252Fwww.link-three.mu%252F"

print(re.findall("https?.*?(?=https?|$)",string))

result:

['https://www.link1.net/abc/cik?xai=En8MmT__aF_nQm-F48&amp;sig=Cg0A7_5AE&amp;urlfix=1&amp;;ccurl=',
 'https://aax-us.link-two.com/x/c/Qoj_sZnkA%2526adurl%253D', 
 'http%253A%252F%252Fwww.link-three.mu%252F']

as noted in comments, since you cannot add : to the delimiter, you have no way of being sure of the URL delimitation (if an URL contains http inside the address you're toast)

edited Feb 7, 2018 at 20:41

answered Feb 7, 2018 at 20:31

Jean-François Fabre♦

141k24 gold badges179 silver badges246 bronze badges

1 Comment

Adrian Over a year ago

Tested and working well but the string methods worked faster in my large scale project. I used this for another issue tho.

Collectives™ on Stack Overflow

Split a string but keep the delimiter in the same resulting substring in Python

2 Answers 2

3 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related