scrapy spider using seleniumbase middleware scraping 'chrome-extension' URLs that weren't requested

Question

I'm currently running a scrapy spider using a seleniumbase middleware and for some reason it is scraping chrome-extension URLs. I'm scraping the https://www.atptour.com website and at no point does my scraper request anything other than pages from that website.

I've attached below my log of what's happening:

2024-10-21 17:43:47: [INFO] Spider opened
2024-10-21 17:43:47: [INFO] Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-10-21 17:43:47: [INFO] Telnet console listening on 127.0.0.1:6027
2024-10-21 17:43:50: [DEBUG] Started executable: `/Users/philipjoss/miniconda3/envs/capra_production/lib/python3.11/site-packages/seleniumbase/drivers/uc_driver` in a child process with pid: 22177 using 0 to output -1
2024-10-21 17:43:51: [DEBUG] Crawled (200) <GET https://www.atptour.com/en/-/tournaments/calendar/tour> (referer: None)
2024-10-21 17:43:54: [DEBUG] Started executable: `/Users/philipjoss/miniconda3/envs/capra_production/lib/python3.11/site-packages/seleniumbase/drivers/uc_driver` in a child process with pid: 22180 using 0 to output -1
2024-10-21 17:43:55: [DEBUG] Crawled (200) <GET https://www.atptour.com/en/-/tournaments/calendar/challenger> (referer: None)
2024-10-21 17:43:55: [DEBUG] Crawled (200) <GET chrome-extension://neajdppkdcdipfabeoofebfddakdcjhd/audio.html> (referer: chrome-extension://neajdppkdcdipfabeoofebfddakdcjhd/audio.html)

There are two successful responses from web pages I have requested and then suddenly a chrome-extension URL appears. What's also weird is that the referer is listed as the same address which has never been requested previously.

To make things more interesting I've run the code on another machine and it ran fine there with the same package versions: scrapy 2.11.2 and seleniumbase 4.28.5.

This is the spider:

from scrapy import Request, Spider
from scrapy.http.response.html import HtmlResponse


class Production(Spider):

    name = "atp_production"

    start_urls = [
        "https://www.atptour.com/en/-/tournaments/calendar/tour",
        "https://www.atptour.com/en/-/tournaments/calendar/challenger",
    ]

    def start_requests(self):
        for url in self.start_urls:
            yield Request(
                url=url,
                callback=self._parse_calendar,
                meta=dict(dont_redirect=True),
            )

    def _parse_calendar(self, response: HtmlResponse):
        json_str = response.xpath("//body//text()").get()

And this is the middleware:

class SeleniumBase:
    @classmethod
    def from_crawler(cls, crawler: Crawler):
        middleware = cls(crawler.settings)
        crawler.signals.connect(middleware.spider_closed, signals.spider_closed)
        
        return middleware

    def __init__(self, settings: dict[str, Any]) -> None:
        self.driver = sb.Driver(
            uc=settings.get("UNDETECTABLE", None),
            headless=settings.get("HEADLESS", None),
            user_data_dir=settings.get("USER_DATA_DIR", None),
        )

    def spider_closed(self, *_) -> None:
        self.driver.quit()

    def process_request(self, request: Request, spider: Spider) -> HtmlResponse:
        self.driver.get(request.url)

        return HtmlResponse(
            self.driver.current_url,
            body=self.driver.page_source,
            encoding="utf-8",
            request=request,
        )

Any ideas on what might be happening?

Update:

scrapy seems to have gone completely haywire now. It's not sending responses to the correct callbacks for all the downstream parsing methods (that aren't in the MRE above) about 95% of the time. I can't really add the logic to the MRE above as it's very complex and SO will complain that I have too much code in my question. Suffice to say I've triple checked everything and besides - it all runs fine on my other machine so the references are definitely all correct.

I've gone nuclear and reinstalled scrapy and seleniumbase but that hasn't solved the issue :(

Update 2:

So I've been poking around some more and a partial diagnosis of the issue is that the website is redirecting to a different address but the response from scrapy is listed as 200. I'm assuming this bypasses the don't_redirect instruction. Interestingly simply re-raising the request again and submitting it always returns a non-directed response. This allowed me to come up with a solution (I'll post a bit) that works for this case but I'd still be interested if anyone had a better explanation for what might be going on!!!

First use firebug to examine the calls and look for anything that might indicate a redirection, also look for bot detection. From there if you don't see any calls to redirect or some sort of cookie/session token, the next step is to try another user agent and then adjust your TLS ClientHello to accept SSL 3.0 and TLS 1.* Note this is unsecure. Typically when getting error codes or redirects either your not loading it properly or you are missing some headers. Note that not all sites are easy to spider and you may have to pull it out of the JS. Let me know how this works if not I'll try myself. — claykom
– claykom, Commented Oct 23, 2024 at 15:31
@claykom - thanks man. I'm going to add an update and a solution in a sec. The most odd thing is that a simple re-raising of the request returns a non-redirect response. So far this is happening 100% of the time. Add to this that the exact same code and environment on another machine never redirects then I'm a little stumped as to what's going on... — Jossy
– Jossy, Commented Oct 23, 2024 at 16:37
I just had another idea, I've just encountered a site that is redirecting when you don't have the correct cookies, also check out the site with a browser tracer that shows http and look for an other discrepancy. since it's working on another machine it may not be the case. it's something going on it in http layer and isn't giving you a good response therefore I'd start there. — claykom
– claykom, Commented Oct 23, 2024 at 19:47

Jossy · Accepted Answer · 2024-10-29 07:56:33Z

Update:

This issue was solved when I upgraded to seleniumbase 4.32.5.

Original answer:

As per my update 2 in my question I was able to come up with a solution based on my specific case. I wrote a downloader middleware to check the responses from the seleniumbase downloader middleware and retry if the response URL didn't match the request URL:

class RetryMissedRedirectMiddleware:
    """Retry responses from redirected URLs

    Redirected URLs can still have response status codes of 200 which mean they
    bypass the regular 'dont_redirect' filter

    """

    def process_response(
        self,
        request: Request,
        response: HtmlResponse,
        spider: Spider,
    ) -> HtmlResponse:
        redirects = list()
        if (
            response.status == 200
            and response.url != request.url
            and request.meta.get("dont_redirect", False)
        ):
            max_retries = request.meta.get("max_redirect_retries", None)

            if max_retries is None:
                raise ValueError(
                    "max_retries must be set in request.meta when dont_redirect "
                    "is True"
                )

            redirect_retries_attempted = request.meta.get(
                "redirect_retries_attempted", 0
            )

            if redirect_retries_attempted >= max_retries:
                raise TooManyRedirectsException(
                    f"Redirected from {request.url} {max_retries} times which is "
                    f"the max allowed:\n" + "\n".join(redirects)
                )

            redirect_retries_attempted += 1

            request.meta["redirect_retries_attempted"] = redirect_retries_attempted

            request.dont_filter = True

            spider.logger.warning(
                f"Redirected from {request.url} to {response.url} - "
                f"this is retry attempt 1 of {max_retries}"
            )

            return request

        else:
            spider.logger.debug(
                f"Response from {response.url} not the result of redirection"
            )

            return response

You'll need to add this to your settings. Here are mine:

DOWNLOADER_MIDDLEWARES = {
    "utilities.scraping.scrapy.middlewares.downloader.SeleniumBase": 400,
    "utilities.scraping.scrapy.middlewares.downloader.RetryMissedRedirectMiddleware": 500
}

And then you can make sure you raise a Request like this:

yield Request(
    url="www.your_url.com",
    callback=self.you_parse_method,
    meta=dict(dont_redirect=True, max_redirect_retries=3),
)
``

Collectives™ on Stack Overflow

scrapy spider using seleniumbase middleware scraping 'chrome-extension' URLs that weren't requested

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related