urllib.error.HTTPError: HTTP Error 403: Forbidden in my web scraping

Question

I try to do a webscraping script who gives me if a website is a wordpress or no, but i get this error:

urllib.error.HTTPError: HTTP Error 403: Forbidden

and i don't understand, i use this headers who is supposed to pass it (in other stacks overflow):

   headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:66.0) Gecko/20100101 Firefox/66.0", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language": "fr-fr,en;q=0.5", "Accept-Encoding": "gzip, deflate", "DNT": "1", "Connection": "close", "Upgrade-Insecure-Requests": "1"}

there is my function;


def check_web_wp(url):
    is_wordpress = False
    print(repr(url))
    headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:66.0) Gecko/20100101 Firefox/66.0", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language": "fr-fr,en;q=0.5", "Accept-Encoding": "gzip, deflate", "DNT": "1", "Connection": "close", "Upgrade-Insecure-Requests": "1"}

    response = requests.get(url, headers=headers)

    with urllib.request.urlopen(url) as response:
        texte = response.read()
        poste_string = str(texte)
        splitted = poste_string.split()
    
        for word in splitted:
            if ("wordpress" in word):
                is_wordpress = True
                break
            
    return is_wordpress


def main():
    url = "https://icalendrier.fr/"
    is_wp = check_web_wp(url)

did i miss something? Is it the website who is too much "securised"?

Thanks for yours answers

Your line with urllib.request.urlopen(url) as response: (without the headers) is overwriting your previous response object from response = requests.get(url, headers=headers) (with headers). — iScripters
– iScripters, Commented Sep 13, 2021 at 15:31
Too late to edit my previous comment, but take a look at trinket.io/python3/f89a92b52d — iScripters
– iScripters, Commented Sep 13, 2021 at 15:37
Thanks for tour answer, thanks for your link, i prefer your method! you can add your comment as a question, i will approve it :) — pgmendormi
– pgmendormi, Commented Sep 13, 2021 at 15:41

iScripters · Accepted Answer · 2021-09-13 16:08:53Z

(As requested, my comment as answer)

Your line with urllib.request.urlopen(url) as response: (without the headers) is overwriting your previous response object from response = requests.get(url, headers=headers) (with headers).

Use requests only instead of urllib, like so:

def check_web_wp_fixed(url):
    is_wordpress = False
    headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:66.0) Gecko/20100101 Firefox/66.0", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language": "fr-fr,en;q=0.5", "Accept-Encoding": "gzip, deflate", "DNT": "1", "Connection": "close", "Upgrade-Insecure-Requests": "1"}

    response = requests.get(url, headers=headers)
    splitted = response.text.split()
    
    for word in splitted:
        if ("wordpress" in word):
            is_wordpress = True
            break
            
    return is_wordpress

(Only made it working, didn't check to see if the code could be optimized in any way)

Collectives™ on Stack Overflow

urllib.error.HTTPError: HTTP Error 403: Forbidden in my web scraping

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related