I'm currently running a scrapy spider using a seleniumbase middleware and for some reason it is scraping chrome-extension URLs. I'm scraping the https://www.atptour.com website and at no point does my scraper request anything other than pages from that website.
I've attached below my log of what's happening:
2024-10-21 17:43:47: [INFO] Spider opened
2024-10-21 17:43:47: [INFO] Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-10-21 17:43:47: [INFO] Telnet console listening on 127.0.0.1:6027
2024-10-21 17:43:50: [DEBUG] Started executable: `/Users/philipjoss/miniconda3/envs/capra_production/lib/python3.11/site-packages/seleniumbase/drivers/uc_driver` in a child process with pid: 22177 using 0 to output -1
2024-10-21 17:43:51: [DEBUG] Crawled (200) <GET https://www.atptour.com/en/-/tournaments/calendar/tour> (referer: None)
2024-10-21 17:43:54: [DEBUG] Started executable: `/Users/philipjoss/miniconda3/envs/capra_production/lib/python3.11/site-packages/seleniumbase/drivers/uc_driver` in a child process with pid: 22180 using 0 to output -1
2024-10-21 17:43:55: [DEBUG] Crawled (200) <GET https://www.atptour.com/en/-/tournaments/calendar/challenger> (referer: None)
2024-10-21 17:43:55: [DEBUG] Crawled (200) <GET chrome-extension://neajdppkdcdipfabeoofebfddakdcjhd/audio.html> (referer: chrome-extension://neajdppkdcdipfabeoofebfddakdcjhd/audio.html)
There are two successful responses from web pages I have requested and then suddenly a chrome-extension URL appears. What's also weird is that the referer is listed as the same address which has never been requested previously.
To make things more interesting I've run the code on another machine and it ran fine there with the same package versions: scrapy 2.11.2 and seleniumbase 4.28.5.
This is the spider:
from scrapy import Request, Spider
from scrapy.http.response.html import HtmlResponse
class Production(Spider):
name = "atp_production"
start_urls = [
"https://www.atptour.com/en/-/tournaments/calendar/tour",
"https://www.atptour.com/en/-/tournaments/calendar/challenger",
]
def start_requests(self):
for url in self.start_urls:
yield Request(
url=url,
callback=self._parse_calendar,
meta=dict(dont_redirect=True),
)
def _parse_calendar(self, response: HtmlResponse):
json_str = response.xpath("//body//text()").get()
And this is the middleware:
class SeleniumBase:
@classmethod
def from_crawler(cls, crawler: Crawler):
middleware = cls(crawler.settings)
crawler.signals.connect(middleware.spider_closed, signals.spider_closed)
return middleware
def __init__(self, settings: dict[str, Any]) -> None:
self.driver = sb.Driver(
uc=settings.get("UNDETECTABLE", None),
headless=settings.get("HEADLESS", None),
user_data_dir=settings.get("USER_DATA_DIR", None),
)
def spider_closed(self, *_) -> None:
self.driver.quit()
def process_request(self, request: Request, spider: Spider) -> HtmlResponse:
self.driver.get(request.url)
return HtmlResponse(
self.driver.current_url,
body=self.driver.page_source,
encoding="utf-8",
request=request,
)
Any ideas on what might be happening?
Update:
scrapy seems to have gone completely haywire now. It's not sending responses to the correct callbacks for all the downstream parsing methods (that aren't in the MRE above) about 95% of the time. I can't really add the logic to the MRE above as it's very complex and SO will complain that I have too much code in my question. Suffice to say I've triple checked everything and besides - it all runs fine on my other machine so the references are definitely all correct.
I've gone nuclear and reinstalled scrapy and seleniumbase but that hasn't solved the issue :(
Update 2:
So I've been poking around some more and a partial diagnosis of the issue is that the website is redirecting to a different address but the response from scrapy is listed as 200. I'm assuming this bypasses the don't_redirect instruction. Interestingly simply re-raising the request again and submitting it always returns a non-directed response. This allowed me to come up with a solution (I'll post a bit) that works for this case but I'd still be interested if anyone had a better explanation for what might be going on!!!