0

Trying to use requests to download a list of urls and catch the exception if it is a bad url. Here's my test code:

import requests
from requests.exceptions import ConnectionError

#goodurl
url = "http://www.google.com"

#badurl with good host
#url = "http://www.google.com/thereisnothing.jpg"

#url with bad host
#url = "http://somethingpotato.com"    

print url
try:
    r = requests.get(url, allow_redirects=True)
    print "the url is good"
except ConnectionError,e:
    print e
    print "the url is bad"

The problem is if I pass in url = "http://www.google.com" everything works as it should and as expected since it is a good url.

http://www.google.com
the url is good

But if I pass in url = "http://www.google.com/thereisnothing.jpg"

I still get :

http://www.google.com/thereisnothing.jpg
the url is good

So its almost like its not even looking at anything after the "/"

just to see if the error checking is working at all I passed a bad hostname: #url = "http://somethingpotato.com"

Which kicked back the error message I expected:

http://somethingpotato.com
HTTPConnectionPool(host='somethingpotato.com', port=80): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f1b6cd15b90>: Failed to establish a new connection: [Errno -2] Name or service not known',))
the url is bad

What am I missing to make request capture a bad url not just a bad hostname?

Thanks

2 Answers 2

3

requests do not create a throwable exception at a 404 response. Instead you need to filter them out be checking to see if the status is 'ok' (HTTP response 200)

import requests
from requests.exceptions import ConnectionError

#goodurl
url = "http://www.google.com/nothing"

#badurl with good host
#url = "http://www.google.com/thereisnothing.jpg"

#url with bad host
#url = "http://somethingpotato.com"    

print url
try:
    r = requests.get(url, allow_redirects=True)
    if r.status_code == requests.codes.ok:
        print "the url is good"
    else:
        print "the url is bad"
except ConnectionError,e:
    print e
    print "the url is bad"

EDIT: import requests from requests.exceptions import ConnectionError

def printFailedUrl(url, response):
    if isinstance(response, ConnectionError):
        print "The url " + url + " failed to connect with the exception " + str(response)
    else:
        print "The url " + url + " produced the failed response code " + str(response.status_code)

def testUrl(url):
    try:
        r = requests.get(url, allow_redirects=True)
        if r.status_code == requests.codes.ok:
            print "the url is good"
        else:
            printFailedUrl(url, r)
    except ConnectionError,e:
        printFailedUrl(url, e)

def main():
    testUrl("http://www.google.com") #'Good' Url 
    testUrl("http://www.google.com/doesnotexist.jpg") #'Bad' Url with 404 response
    testUrl("http://sdjgb") #'Bad' url with inaccessable url

main()

In this case one function can handle both getting an exception or a request response passed into it. This way you can have separate responses for if the url returns some non 'good' (non-200) response vs an unusable url which throws an exception. Hope this has the information you need in it.

Sign up to request clarification or add additional context in comments.

6 Comments

I was hoping to pass in the actually error (e) since I want to write the error to a file. Perhaps instead of printing "url is bad" in the else statement I can print r.status_code (which should be bad if it made to the else statement). Would you suggest that?
@chowpay The issue is that it is not actually an error so you wont be able to catch an error to print. I would suggest making a custom function to print out an "error" url which could include the url and the status code. If you would like an example I could add to my answer
@chowpay see if the code I added at the bottom does what you need
@chowpay No problem!
question, for "isinstance(response,ConnectionError)" since you're feeding the function url and a response. how does "isinstance" know that your "response" is a status code error vs a ConnectionError?
|
0

what you want is to check r.status_code. Getting r.status_code on "http://www.google.com/thereisnothing.jpg" will give you 404. you can put a condition for only 200 code URL to be "good".

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.