2

I am working on a web crawler these days. In that project when my crawler gathers the links in the site some are URLs are like ; about.html , /pages , #form-login , javascript:validate(); , ../help , ../../ , ./ .

I have tried urllib's urlparse , urljoin and os module's join functions. However given below is the part of the code of my project which is related to the question.


from urllib.parse import urlparse, urljoin

base_url = input('Enter base url : ')


def make_links(link):
    u = urlparse(link)
    if link[:3] == 'www':
        link = u['scheme'] + link
    elif link[:1] == '/':
        link = base_url + link
    elif link[:3] == '../':
        link = urljoin(base_url, link)
    elif link[:2] == './':
        link = urljoin(base_url, link)
        link = base_url + '/' + link
    print(link)


while True:
    i = input("Enter your url : ")
    if i == 'exit':
        break
    else:
        make_links(i)

I except the output of the relative URL inputted by the user to be relative to the base URL inputted by the user. When the user inputs a absolute URL as the base_url and then when the user enters the relative URL the output should be the absolute URL path where the user can access the web page through a browser. This program also should support any type of relative URL. If you want to know the ways of relative URLs represented, refer this,

http://webreference.com/html/tutorial2/3.html

It should not execute javascript when the program comes across URLs like javascript:alert('foo-bar') 😜 😜 😜

1
  • Do you have sample user inputs for testing ? Commented Aug 8, 2019 at 9:10

1 Answer 1

1

urljoin does most of the heavy lifting for you. Hence, something as simple as this would do the trick:

def make_links(link):
    url = urljoin(base_url, link)
    parsed = urlparse(url)
    if not parsed.scheme or not parsed.scheme.startswith('http'):
        # invalid, e.g. javascript, etc.
        return None
    return url

Example:

Enter base url : http://example.com/dir1/file.php
Enter your url : ../dir2
http://example.com/dir2
Enter your url : #hello
http://example.com/dir1/file.php#hello
Enter your url : javascript: return false
None
Enter your url : /world
http://example.com/world
Enter your url : www.test.com
http://example.com/dir1/www.test.com
Enter your url : http://www.test.com
http://www.test.com

As you can see, the only downside is the necessity to start urls with http. And this actually makes sense, as there are no strict rules: a website could use www as a subresource...

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.