Why am I getting a 403 Forbidden when scraping with python?

Question

I am trying to scrape a certain website let's call it "https://some-website.com". For the past few months I was able to do it without problems however a few days ago I noticed the scraper no longer works as all requests return a 403 Forbidden status.

For the last 3 months I was using the below code to scrape the data.

import requests
from fake_useragent import UserAgent

res = requests.get(<url>, headers={'User-Agent': UserAgent().random})

This always returned a nice 200 OK with the page I needed. Until a few days ago I started getting a 403 Forbidden error. And somewhere in the return text I can spot a sentence "Enable JavaScript and cookies to continue".

User-Agent issue

As you can see in the code I already randomly switch user-agent header which is usually the recommendation to fix this kind of problems.

IP issue

Naturally I suspected they blacklisted my IP (maybe in combination with some user agents and don't allow me to scrape). However I implemented a solution to use a proxy and I still get a 403.

import requests
from fake_useragent import UserAgent

proxies = {
   "https": f'http://some_legit_proxy',
   "http": f'http://some_legit_proxy',
}

res = requests.get(<url>, headers={'User-Agent': UserAgent().random}, proxies=proxies)

The proxy is a residential proxy.

Basic Attempt actually works

What baffles me the most is that if I remove the random user-agent part and use the default requests user-agent the scrape all of a sudden works.

import requests

res = requests.get(<url>) # 'User-Agent': 'python-requests/2.28.1'
# 200 OK

This tells me that it doesn't mean the website all of a sudden need javascript as the scrape does work it just seems they are somehow blocking me.

I have a few ideas in mind to work around this but as I don't understand how is this happening I cannot be sure this will be scalable in the future.

Please help me understand what is happening here.

There's a lot of possible reasons. They might be using a WAF provider like Cloudflare to block any requests coming from bots. They might be having a Javascript that needs to solve a challenge before you get access to the webpage. But looking at how using the default user agent works, It's probably TLS fingerprinting. — Salman Farsi
– Salman Farsi, Commented Oct 12, 2022 at 23:56
Looks like they caught on to your actions. Sites generally don't appreciate you scraping their content. — Ouroborus
– Ouroborus, Commented Oct 13, 2022 at 0:05
@SalmanFarsi. Thanks for the quick response. I haven't heard of TLS fingerprinting. Is there any go-to action which can be taken to bypass it? — Dominik Sajovic
– Dominik Sajovic, Commented Oct 13, 2022 at 0:05
I'd recommend taking a look at github.com/VeNoMouS/cloudscraper — Salman Farsi
– Salman Farsi, Commented Oct 13, 2022 at 3:54

h0r53 · Accepted Answer · 2022-10-13 01:00:34Z

1

The site in question is hosted by Cloudflare. Cloudflare does things like TLS fingerprinting on the edge which will determine the User-Agent you've provided doesn't match the TLS fingerprint from Python's request module. This is a common technique used by cloud providers as a means for bot deterrence. I'd recommend first trying to scrape without spoofing the user agent and if you still have troubles consider a modern browser automation platform such as Puppeteer.

Good luck friend. :)

answered Oct 13, 2022 at 1:00

h0r53

3,2392 gold badges20 silver badges27 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Dominik Sajovic Over a year ago

Do you know the python wrapper of puppeteer is any good? pypi.org/project/pyppeteer

h0r53 Over a year ago

@DominikSajovic I haven't used it. It seems very similar to the NodeJS version though. If you're more comfortable with Python, and you are familiar with async programming, go for it!

kaliiiiiiiii · Accepted Answer · 2023-01-29 13:59:08Z

0

as @h0r53 mentioned, I think cloudfare detects if a request is made by js.

You could try using this answer

answered Jan 29, 2023 at 13:59

kaliiiiiiiii

1,1551 gold badge8 silver badges30 bronze badges

Collectives™ on Stack Overflow

Why am I getting a 403 Forbidden when scraping with python?

User-Agent issue

IP issue

Basic Attempt actually works

2 Answers 2

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

User-Agent issue

IP issue

Basic Attempt actually works

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related