I am working on a web crawler these days. In that project when my crawler gathers the links in the site some are URLs are like ; about.html , /pages , #form-login , javascript:validate(); , ../help , ../../ , ./ .
I have tried urllib's urlparse , urljoin and os module's join functions. However given below is the part of the code of my project which is related to the question.
from urllib.parse import urlparse, urljoin
base_url = input('Enter base url : ')
def make_links(link):
u = urlparse(link)
if link[:3] == 'www':
link = u['scheme'] + link
elif link[:1] == '/':
link = base_url + link
elif link[:3] == '../':
link = urljoin(base_url, link)
elif link[:2] == './':
link = urljoin(base_url, link)
link = base_url + '/' + link
print(link)
while True:
i = input("Enter your url : ")
if i == 'exit':
break
else:
make_links(i)
I except the output of the relative URL inputted by the user to be relative to the base URL inputted by the user. When the user inputs a absolute URL as the base_url and then when the user enters the relative URL the output should be the absolute URL path where the user can access the web page through a browser. This program also should support any type of relative URL. If you want to know the ways of relative URLs represented, refer this,
http://webreference.com/html/tutorial2/3.html
It should not execute javascript when the program comes across URLs like
javascript:alert('foo-bar')😜 😜 😜