0

I try to do a webscraping script who gives me if a website is a wordpress or no, but i get this error:

urllib.error.HTTPError: HTTP Error 403: Forbidden

and i don't understand, i use this headers who is supposed to pass it (in other stacks overflow):

   headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:66.0) Gecko/20100101 Firefox/66.0", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language": "fr-fr,en;q=0.5", "Accept-Encoding": "gzip, deflate", "DNT": "1", "Connection": "close", "Upgrade-Insecure-Requests": "1"}

there is my function;


def check_web_wp(url):
    is_wordpress = False
    print(repr(url))
    headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:66.0) Gecko/20100101 Firefox/66.0", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language": "fr-fr,en;q=0.5", "Accept-Encoding": "gzip, deflate", "DNT": "1", "Connection": "close", "Upgrade-Insecure-Requests": "1"}

    response = requests.get(url, headers=headers)

    with urllib.request.urlopen(url) as response:
        texte = response.read()
        poste_string = str(texte)
        splitted = poste_string.split()
    
        for word in splitted:
            if ("wordpress" in word):
                is_wordpress = True
                break
            
    return is_wordpress


def main():
    url = "https://icalendrier.fr/"
    is_wp = check_web_wp(url)

did i miss something? Is it the website who is too much "securised"?

Thanks for yours answers

4
  • 1
    Your line with urllib.request.urlopen(url) as response: (without the headers) is overwriting your previous response object from response = requests.get(url, headers=headers) (with headers). Commented Sep 13, 2021 at 15:31
  • 1
    Too late to edit my previous comment, but take a look at trinket.io/python3/f89a92b52d Commented Sep 13, 2021 at 15:37
  • Thanks for tour answer, thanks for your link, i prefer your method! you can add your comment as a question, i will approve it :) Commented Sep 13, 2021 at 15:41
  • Posted as answer! :) Commented Sep 13, 2021 at 16:09

1 Answer 1

1

(As requested, my comment as answer)

Your line with urllib.request.urlopen(url) as response: (without the headers) is overwriting your previous response object from response = requests.get(url, headers=headers) (with headers).

Use requests only instead of urllib, like so:

def check_web_wp_fixed(url):
    is_wordpress = False
    headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:66.0) Gecko/20100101 Firefox/66.0", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language": "fr-fr,en;q=0.5", "Accept-Encoding": "gzip, deflate", "DNT": "1", "Connection": "close", "Upgrade-Insecure-Requests": "1"}

    response = requests.get(url, headers=headers)
    splitted = response.text.split()
    
    for word in splitted:
        if ("wordpress" in word):
            is_wordpress = True
            break
            
    return is_wordpress

(Only made it working, didn't check to see if the code could be optimized in any way)

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.