2

how to verify if the page url exist and not redirect to not found url page
example :

import socket
try:
    socket.gethostbyname('www.google.com/imghp')
except socket.gaierror as ex:
    print "Not existe"

it retourn alwayse Not existe

2
  • What are you trying to accomplish? socket.gethostbyname doesn't take a URL. Probably you want to make an HTTP request, which is a totally different API. Commented Mar 8, 2015 at 12:32
  • gethostbyname() can be uses with a host name not with a (incomplete) URL. Try gethostbyname('www.google.com') Commented Mar 8, 2015 at 12:33

2 Answers 2

4

you're using the wrong tool for the task!

screw hammer

From the manual:

socket.gethostbyname(hostname)

Translate a host name to IPv4 address format. The IPv4 address is returned as a string, such as '100.50.200.5'. If the host name is an IPv4 address itself it is returned unchanged. See gethostbyname_ex() for a more complete interface. gethostbyname() does not support IPv6 name resolution, and getaddrinfo() should be used instead for IPv4/v6 dual stack support.

That tool is to check if a domain exists, and get its IP address:

>>> try:
...     print(socket.gethostbyname('www.google.com'))
... except socket.gaierror as ex:
...     print("Does not exists")
... 
216.58.211.132

what you may want is to actually connect to the site and check if there's a page:

>>> import requests
>>> response = requests.head('http://www.google.com/imghp')
>>> if response.status_code == 404:
...    print("Does not exists")
... else:
...    print("Exists")
...
Exists

The .head() method from only gets the information about the page from the webserver, but not the page itself, so it's very lightweight in terms of network usage.

spoiler alert: if you try to get the contents of the page, using response.content, you'll get nothing, for that you need to use the .get() method.


update #1

the site you're checking against is broken, i.e. it does not follow internet standards. Instead of giving a 404, it's giving a 302 to redirect to the "page does not exists" page with a status code of 200:

>>> response = requests.head('http://qamarsoft.com/does_not_exists', allow_redirects=True)
>>> response.status_code
200

To sort that out, you need to get the page of that site, and check that the redirected URI has 404 in the redirection URL:

>>> response = requests.head('http://qamarsoft.com/does_not_exists'
>>> response.headers['location']
'http://qamarsoft.com/404'

So the test would become:

>>> response = requests.head('http://qamarsoft.com/does_not_exists')
>>> if '404' in response.headers['location']:
...     print('Does not exists')
... else:
...     print('Exists')
Exists

update #2

for the second URL, you can try it out yourself in the python console:

>>> import requests
>>> response = requests.head('http://www.***********.ma/does_not_Exists')
>>> if response.status_code == 404:
...    print("Does not exists")
... else:
...    print("Exists")
...
Does not exists
>>> response = requests.head('http://www.***********.ma/annonceur/a/3550/n.php ')
>>> if response.status_code == 404:
...    print("Does not exists")
... else:
...    print("Exists")
...
Exists

Nota Bene

you might want to install the requests package:

pip install requests

or if you're modern and use python3:

pip3 install requests
Sign up to request clarification or add additional context in comments.

8 Comments

try using your code with the page [link]qamarsoft.com/HttpUrlDontExist[/link] will return Exists !!!
because it outputs a 302 code to get to a 404 page, which in turn gives a 200 status code. The given site is broken, not my code! :-)
thanks brother using you're first post with a test on 200 response
Well, don't miss all the other status errors, like 500, 503, 403 etc… As for the second site you gave, you wanted to the existence of a page on a site that does not respects Internet standards. All in all, you might have a valid page that does a 302 Redirect code, followed by a 200 OK code, while the other one you gave gives the same but truly is a 404.
can you just delete the second url of test plz
|
0

It's true that with gethostbyname() you will not get what you want done. Consider using urllib2. In your case the following could do what you want:

import urllib2

#The data variable can be used to send POST data
data=None
#Here add as many header fields as you wish
headers={"User-agent":"Blahblah", "Cookie":"yourcookievalues"}
url = "http://www.google.com/imghp"
request = urllib2.Request(url, data, headers)
try:
    response = urllib2.urlopen(request)
    #Check redirection here
    if (response.geturl() != url):
         print "The page at: "+url+" redirected to: "+response.geturl()
except urllib2.HTTPError as err:
    #Catch 404s etc.
    print "Failed with code: "+str(err)

Hope this helps you out!

4 Comments

File "<stdin>", line 2, in <module> NameError: name 'request' is not defined
anyway, python-requests does the same thing, but in less lines!
@zmo well yeah, but if you take out my comments, blank lines and the variable declaration and just pass them directly to the function, like you did, then it's pretty much the same.
your code is equivalent to: print("Exists" if requests.head(URL, allow_redirects=True).status_code != 404 else "Not Exists"), no need for the try block and the redirect check. There's a good reason why python-requests rocks :-)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.