0

I have written a code that fetches the html code of any given site and then fetch all links from it and save it inside a list. My goal is that I want to change all the relative links in html file with absolute links.

Here are the links:

src="../styles/scripts/jquery-1.9.1.min.js"
href="/PhoneBook.ico"
href="../css_responsive/fontsss.css"
src="http://www.google.com/adsense/search/ads.js"
L.src = '//www.google.com/adsense/search/async-ads.js'
href="../../"
src='../../images/plus.png'
vrUrl ="search.aspx?searchtype=cat"

These are few links that I have copied from html file to keep the question simple and less error prone.

Following are the different URLs used in html file:

http://yourdomain.com/images/example.png
//yourdomain.com/images/example.png
/images/example.png
images/example.png
../images/example.png
../../images/example.png

Python code:

linkList = re.findall(re.compile(u'(?<=href=").*?(?=")|(?<=href=\').*?(?=\')|(?<=src=").*?(?=")|(?<=src=\').*?(?=\')|(?<=action=").*?(?=")|(?<=vrUrl =").*?(?=")|(?<=\')//.*?(?=\')'), str(html))

newLinks = []
for link1 in linkList:
    if (link1.startswith("//")):
        newLinks.append(link1)
    elif (link1.startswith("../")):
        newLinks.append(link1)
    elif (link1.startswith("../../")):
        newLinks.append(link1)
    elif (link1.startswith("http")):
        newLinks.append(link1)
    elif (link1.startswith("/")):
        newLinks.append(link1)
    else:
        newLinks.append(link1)

At this point what happens is when it comes to second condition which is "../" it gives me all the urls which starts with "../" as well as "../../". This is the behavior which I don't need. Same goes for "/"; it also fetches urls starting with "//". I also tried to used the beginning and end parameters of "startswith" function but that doesn't solve the issue.

8
  • How about swapping the order of the elif statements, such that ../../ is checked before ../? Commented Jul 23, 2016 at 10:39
  • This will work but incase I don't want to change some specific url like all those start with "//" need to remain same but when execution will reach "/" it will still compute the other ones with 2 slashes. Commented Jul 23, 2016 at 10:44
  • It's dirty, but you could use an empty elif statement (using pass) in that case... Commented Jul 23, 2016 at 10:45
  • Currently the only solution that comes to my mind is instead of using one regex I can create multiples regex for all those url patterns separately, that way I can restrict the number of "/" in url. But i don't like the idea of using so many regex for this work. There has to be some simple way. Commented Jul 23, 2016 at 10:46
  • @jojonas you don't get my point if i "pass" the double slash if statement still those urls will again qualify in single slash statement. Commented Jul 23, 2016 at 10:47

1 Answer 1

1

How about using str.count method:

>>> src="../styles/scripts/jquery-1.9.1.min.js"
>>> src2='../../images/plus.png'
>>> src.count('../')
1
>>> src2.count('../')
2

This seems to be true as ../ only exists at the beginning of urls

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.