I have written a code that fetches the html code of any given site and then fetch all links from it and save it inside a list. My goal is that I want to change all the relative links in html file with absolute links.
Here are the links:
src="../styles/scripts/jquery-1.9.1.min.js"
href="/PhoneBook.ico"
href="../css_responsive/fontsss.css"
src="http://www.google.com/adsense/search/ads.js"
L.src = '//www.google.com/adsense/search/async-ads.js'
href="../../"
src='../../images/plus.png'
vrUrl ="search.aspx?searchtype=cat"
These are few links that I have copied from html file to keep the question simple and less error prone.
Following are the different URLs used in html file:
http://yourdomain.com/images/example.png //yourdomain.com/images/example.png /images/example.png images/example.png ../images/example.png ../../images/example.png
Python code:
linkList = re.findall(re.compile(u'(?<=href=").*?(?=")|(?<=href=\').*?(?=\')|(?<=src=").*?(?=")|(?<=src=\').*?(?=\')|(?<=action=").*?(?=")|(?<=vrUrl =").*?(?=")|(?<=\')//.*?(?=\')'), str(html))
newLinks = []
for link1 in linkList:
if (link1.startswith("//")):
newLinks.append(link1)
elif (link1.startswith("../")):
newLinks.append(link1)
elif (link1.startswith("../../")):
newLinks.append(link1)
elif (link1.startswith("http")):
newLinks.append(link1)
elif (link1.startswith("/")):
newLinks.append(link1)
else:
newLinks.append(link1)
At this point what happens is when it comes to second condition which is "../" it gives me all the urls which starts with "../" as well as "../../". This is the behavior which I don't need. Same goes for "/"; it also fetches urls starting with "//". I also tried to used the beginning and end parameters of "startswith" function but that doesn't solve the issue.
elifstatements, such that../../is checked before../?elifstatement (usingpass) in that case...