0

I have a string containing URLs:

string = https://www.link1.net/abc/cik?xai=En8MmT__aF_nQm-F48&sig=Cg0A7_5AE&urlfix=1&;ccurl=https://aax-us.link-two.com/x/c/Qoj_sZnkA%2526adurl%253Dhttp%253A%252F%252Fwww.link-three.mu%252F

I want to extract all of them to have a result like this:

['https://www.link1.net/abc/cik?xai=En8MmT__aF_nQm-F48&sig=Cg0A7_5AE&urlfix=1&;ccurl=','https://aax-us.link-two.com/x/c/Qoj_sZnkA%2526adurl%253D','http%253A%252F%252Fwww.link-three.mu%252F']

I am trying:

urls = [x for x in re.split('(http[s]?)', string) if x]
print urls 

And the result is:

['https', '://www.link1.net/abc/cik?xai=En8MmT__aF_nQm- F48&sig=Cg0A7_5AE&urlfix=1&;ccurl=', 'https', '://aax-us.link-two.com/x/c/Qoj_sZnkA%2526adurl%253D', 'http', '%253A%252F%252Fwww.link-three.mu%252F']

How can I get the the complete URL together given that it can start with 'http' or 'https'?

Any ideas please?

4
  • 4
    Use a lookahead (?=http). Also, no need to put s in a set [s] as it's interpreted literally by default (it doesn't have special meaning alone). Also, no need to check for s since http is all you really need to look for (think about it, who cares if there's an s at the end of http if http exists - it already satisfies your first requirement). Commented Feb 7, 2018 at 20:25
  • 1
    What is or are the URLs that you try to match? Where do they end? Do you consider the one starting with http%253a a valid URL? Commented Feb 7, 2018 at 20:28
  • This is a single url https://aax-us.link-two.com/x/c/Qoj_sZnkA%2526adurl%253Dhttp%253A%252F%252Fwww.link-three.mu%252F Commented Feb 7, 2018 at 20:47
  • Yes, you're right, http is all I really need. So the whole string comes from a URL redirection scheme in which I need to extract all the URLs in the chain. Now, I am decoding the urls before splitting them so all the urls are valid in the form of http://. Commented Feb 16, 2018 at 15:40

2 Answers 2

2

Without using re, you can handle this problem as follows:

['http' + x for x in filter(lambda x: x, string.split('http'))]

The result will be:

['https://www.link1.net/abc/cik?xai=En8MmT__aF_nQm-
F48&sig=Cg0A7_5AE&urlfix=1&;ccurl=', 'https://aax-us.link-
two.com/x/c/Qoj_sZnkA%2526adurl%253D', 'http%253A%252F%252Fwww.link-
three.mu%252F']
Sign up to request clarification or add additional context in comments.

3 Comments

I believe string methods are typically faster than re, so this is a better solution than the others presented so far.
use filter(None, string.split('http')), it's even cleaner. Otherwise good alternative to regex
Yes, faster indeed!
1

You could use your result and join 2 consecutive matches, that would work.

urls = [urls[i]+urls[i+1] for i in range(0,len(urls),2)]

But better use findall with a lookahead on https? or end of string:

import re

string = "https://www.link1.net/abc/cik?xai=En8MmT__aF_nQm-F48&sig=Cg0A7_5AE&urlfix=1&;ccurl=https://aax-us.link-two.com/x/c/Qoj_sZnkA%2526adurl%253Dhttp%253A%252F%252Fwww.link-three.mu%252F"

print(re.findall("https?.*?(?=https?|$)",string))

result:

['https://www.link1.net/abc/cik?xai=En8MmT__aF_nQm-F48&sig=Cg0A7_5AE&urlfix=1&;ccurl=',
 'https://aax-us.link-two.com/x/c/Qoj_sZnkA%2526adurl%253D', 
 'http%253A%252F%252Fwww.link-three.mu%252F']

as noted in comments, since you cannot add : to the delimiter, you have no way of being sure of the URL delimitation (if an URL contains http inside the address you're toast)

1 Comment

Tested and working well but the string methods worked faster in my large scale project. I used this for another issue tho.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.