4

I am trying to scrape a certain website let's call it "https://some-website.com". For the past few months I was able to do it without problems however a few days ago I noticed the scraper no longer works as all requests return a 403 Forbidden status.

For the last 3 months I was using the below code to scrape the data.

import requests
from fake_useragent import UserAgent

res = requests.get(<url>, headers={'User-Agent': UserAgent().random})

This always returned a nice 200 OK with the page I needed. Until a few days ago I started getting a 403 Forbidden error. And somewhere in the return text I can spot a sentence "Enable JavaScript and cookies to continue".

User-Agent issue

As you can see in the code I already randomly switch user-agent header which is usually the recommendation to fix this kind of problems.

IP issue

Naturally I suspected they blacklisted my IP (maybe in combination with some user agents and don't allow me to scrape). However I implemented a solution to use a proxy and I still get a 403.

import requests
from fake_useragent import UserAgent

proxies = {
   "https": f'http://some_legit_proxy',
   "http": f'http://some_legit_proxy',
}

res = requests.get(<url>, headers={'User-Agent': UserAgent().random}, proxies=proxies)

The proxy is a residential proxy.

Basic Attempt actually works

What baffles me the most is that if I remove the random user-agent part and use the default requests user-agent the scrape all of a sudden works.

import requests

res = requests.get(<url>) # 'User-Agent': 'python-requests/2.28.1'
# 200 OK

This tells me that it doesn't mean the website all of a sudden need javascript as the scrape does work it just seems they are somehow blocking me.

I have a few ideas in mind to work around this but as I don't understand how is this happening I cannot be sure this will be scalable in the future.

Please help me understand what is happening here.

4
  • 1
    There's a lot of possible reasons. They might be using a WAF provider like Cloudflare to block any requests coming from bots. They might be having a Javascript that needs to solve a challenge before you get access to the webpage. But looking at how using the default user agent works, It's probably TLS fingerprinting. Commented Oct 12, 2022 at 23:56
  • Looks like they caught on to your actions. Sites generally don't appreciate you scraping their content. Commented Oct 13, 2022 at 0:05
  • @SalmanFarsi. Thanks for the quick response. I haven't heard of TLS fingerprinting. Is there any go-to action which can be taken to bypass it? Commented Oct 13, 2022 at 0:05
  • 1
    I'd recommend taking a look at github.com/VeNoMouS/cloudscraper Commented Oct 13, 2022 at 3:54

2 Answers 2

1

The site in question is hosted by Cloudflare. Cloudflare does things like TLS fingerprinting on the edge which will determine the User-Agent you've provided doesn't match the TLS fingerprint from Python's request module. This is a common technique used by cloud providers as a means for bot deterrence. I'd recommend first trying to scrape without spoofing the user agent and if you still have troubles consider a modern browser automation platform such as Puppeteer.

Good luck friend. :)

Sign up to request clarification or add additional context in comments.

2 Comments

Do you know the python wrapper of puppeteer is any good? pypi.org/project/pyppeteer
@DominikSajovic I haven't used it. It seems very similar to the NodeJS version though. If you're more comfortable with Python, and you are familiar with async programming, go for it!
0

as @h0r53 mentioned, I think cloudfare detects if a request is made by js.

You could try using this answer

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.