2

Am trying to get Google PageRank for my list of domains but I eventually get this error:

Python3: raise HTTPError(req.full_url, code, msg, hdrs, fp) urllib.error.HTTPError: HTTP Error 403: Forbidden

I have tried some of the existing solutions in regards to my question but none of them solved my problem. Here is my code:

#  Script for getting Google Page Rank of page
#  Google Toolbar 3.0.x/4.0.x Pagerank Checksum Algorithm
#
#  original from http://pagerank.gamesaga.net/
#  this version was adapted from http://www.djangosnippets.org/snippets/221/
#  by Corey Goldberg - 2010
#
#  Licensed under the MIT license: http://www.opensource.org/licenses/mit-license.php


from __future__ import print_function, division
import sys
import urllib.request as _urlib1  # py3 
import urllib.parse as _urlib2  # py 3




def get_pagerank(url):
    hsh = check_hash(hash_url(url))
    user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'
    gurl = 'http://toolbarqueries.google.com/tbr?client=navclient-auto&features=Rank&ch=%s&q=info:%s' % (hsh, _urlib2.quote(url))
    headers={'User-Agent':user_agent,}
    request=_urlib1.Request(gurl,None,headers) #The assembled request
    u = _urlib1.urlopen(request)
    s = u.read().decode('utf-8')  # for py2, comment .decode() part
    #print(s)  # debug - response of server
    rank = s.strip()[9:]
    if rank == '':
        rank = 'None'
    if rank == 'None':
        rank = 'None'
    return rank


def  int_str(string, integer, factor):
    for i in range(len(string)) :
        integer *= factor
        integer &= 0xFFFFFFFF
        integer += ord(string[i])
    return integer


def hash_url(string):
    c1 = int_str(string, 0x1505, 0x21)
    c2 = int_str(string, 0, 0x1003F)

    c1 >>= 2
    c1 = ((c1 >> 4) & 0x3FFFFC0) | (c1 & 0x3F)
    c1 = ((c1 >> 4) & 0x3FFC00) | (c1 & 0x3FF)
    c1 = ((c1 >> 4) & 0x3C000) | (c1 & 0x3FFF)

    t1 = (c1 & 0x3C0) << 4
    t1 |= c1 & 0x3C
    t1 = (t1 << 2) | (c2 & 0xF0F)

    t2 = (c1 & 0xFFFFC000) << 4
    t2 |= c1 & 0x3C00
    t2 = (t2 << 0xA) | (c2 & 0xF0F0000)

    return (t1 | t2)


def check_hash(hash_int):
    hash_str = '%u' % (hash_int)
    flag = 0
    check_byte = 0

    i = len(hash_str) - 1
    while i >= 0:
        byte = int(hash_str[i])
        if 1 == (flag % 2):
            byte *= 2;
            byte = int(byte / 10) + byte % 10
        check_byte += byte
        flag += 1
        i -= 1

    check_byte %= 10
    if 0 != check_byte:
        check_byte = 10 - check_byte
        if 1 == flag % 2:
            if 1 == check_byte % 2:
                check_byte += 9
            check_byte >>= 1

    return '7' + str(check_byte) + hash_str

Can anybody help?

5
  • As a first step, I would catch the exception to see for which URL this happens Commented Jan 19, 2015 at 22:36
  • @Jasper the error is on all of the urls. It seems that google block querying after certain amount from a certain ip. Do you have any comments how can I bypass that? Commented Jan 20, 2015 at 8:36
  • Since the blocking is under Google's control, there's not much you can do except from using different IPs. Are you perhaps breaking some "fair use" conditions? Commented Jan 20, 2015 at 8:51
  • Yeah I also think that I have to find a mechanism to change my ip @Jasper. Do you have any suggestions? Commented Jan 20, 2015 at 11:54
  • I think you can find many hints on how to do that if you are behind a home router (many file sharing sites restrict downloads per IP). Commented Jan 20, 2015 at 12:16

1 Answer 1

1

The problem is not the blocking of IP addresses. I am using Python3 and having the same issue. I found that Google blocks urllib that doesnt overwrite the User-Agent and Accept-Encoding headers.

The headers it used for a test search:

GET /search?q=f1+2015 HTTP/1.1
Accept-Encoding: identity
Connection: close
User-Agent: Python-urllib/3.4
Host: 127.0.0.1:8076

I put 'Accept-Encoding' to '' and 'User-Agent' to 'testing' and the 403 error stopped.

Sign up to request clarification or add additional context in comments.

2 Comments

You can also use a User-Agent header from any browser, such as Firefox, Chrome, etc.
@ForceBru well duh!! More useful to say "You cannot use the User-Agent from urllib".

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.