2

I have a list of urls. I'd like to see the server response code of each and find out if any are broken. I can read server errors (500) and broken links (404) okay, but the code breaks once a non-website is read (e.g. "notawebsite_broken.com"). I've searched around and not found the answer... I hope you can help.

Here's the code:

import urllib2

#List of URLs. The third URL is not a website
urls = ["http://www.google.com","http://www.ebay.com/broken-link",
"http://notawebsite_broken"]

#Empty list to store the output
response_codes = []

# Run "for" loop: get server response code and save results to response_codes
for url in urls:
    try:
        connection = urllib2.urlopen(url)
        response_codes.append(connection.getcode())
        connection.close()
        print url, ' - ', connection.getcode()
    except urllib2.HTTPError, e:
        response_codes.append(e.getcode())
        print url, ' - ', e.getcode()

print response_codes

This gives the output of...

http://www.google.com  -  200
http://www.ebay.com/broken-link  -  404
Traceback (most recent call last):
  File "test.py", line 12, in <module>
    connection = urllib2.urlopen(url)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 127, in urlopen
    return _opener.open(url, data, timeout)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 404, in open
    response = self._open(req, data)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 422, in _open
    '_open', req)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 382, in _call_chain
    result = func(*args)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 1214, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 1184, in do_open
    raise URLError(err)
urllib2.URLError: <urlopen error [Errno 8] nodename nor servname provided, or not known>

Does anyone know a fix for this or can anyone point me in the right direction?

3 Answers 3

3

You could use requests:

import requests

urls = ["http://www.google.com","http://www.ebay.com/broken-link",
"http://notawebsite_broken"]

for u in urls:
    try:
        r = requests.get(u)
        print "{} {}".format(u,r.status_code)
    except Exception,e:
        print "{} {}".format(u,e)

http://www.google.com 200
http://www.ebay.com/broken-link 404
http://notawebsite_broken HTTPConnectionPool(host='notawebsite_broken', port=80): Max retries exceeded with url: /
Sign up to request clarification or add additional context in comments.

1 Comment

+1 Very nice. Changing the code to use except Exception as e will make it work in Python 3.x.x too.
1

When urllib2.urlopen() fails to connect to the server, or fails to resolve the IP of the host, it raises a URLError instead of HTTPError. You'll need to catch urllib2.URLError in addition to urllib2.HTTPError to deal with those cases.

Comments

1

The API for the urllib2 library is a nightmare.

Many people, myself included, strongly recommend using the requests package:

One of the nicer things about requests is that any request issues inherit from a base Exception class. When you use urllib2 "raw", a number of Exceptions can be raised from urllib2, in addition to the socket module and possibly some others ( i can't remember, but its messy )

tldr -- just use the requests library.

1 Comment

Nice. Requests is so much easier! Thanks.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.