I built a web crawler with Scrapy and Django and put the CrawlerRunner code into task queue. In my local everything works fine until run the tasks in the server. I'm thinking multiple threads causing the problem.
This is the task code, I'm using huey for the tasks
from huey import crontab
from huey.contrib.djhuey import db_periodic_task, on_startup
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from scrapy.utils.project import get_project_settings
from twisted.internet import reactor
from apps.core.tasks import CRONTAB_PERIODS
from apps.scrapers.crawler1 import Crawler1
from apps.scrapers.crawler2 import Crawler2
from apps.scrapers.crawler3 import Crawler3
@on_startup(name="scrape_all__on_startup")
@db_periodic_task(crontab(**CRONTAB_PERIODS["every_10_minutes"]))
def scrape_all():
configure_logging()
settings = get_project_settings()
runner = CrawlerRunner(settings=settings)
runner.crawl(Crawler1)
runner.crawl(Crawler2)
runner.crawl(Crawler3)
defer = runner.join()
defer.addBoth(lambda _: reactor.stop())
reactor.run()
and this is the first error I get from sentry.io, it's truncated
Unhandled Error
Traceback (most recent call last):
File "/home/deployer/env/lib/python3.10/site-packages/twisted/internet/base.py", line 501, in fireEvent
DeferredList(beforeResults).addCallback(self._continueFiring)
File "/home/deployer/env/lib/python3.10/site-packages/twisted/internet/defer.py", line 532, in addCallback
return self.addCallbacks(callback, callbackArgs=args, callbackKeywords=kwargs)
File "/home/deployer/env/lib/python3.10/site-packages/twisted/internet/defer.py", line 512, in addCallbacks
self._runCallbacks()
File "/home/deployer/env/lib/python3.10/site-packages/twisted/internet/defer.py", line 892, in _runCallbacks
current.result = callback( # type: ignore[misc]
--- <exception caught here> ---
File "/home/deployer/env/lib/python3.10/site-packages/twisted/internet/base.py", line 513, in _continueFiring
callable(*args, **kwargs)
File "/home/deployer/env/lib/python3.10/site-packages/twisted/internet/base.py", line 1314, in _reallyStartRunning
self._handle...
the task is set to run every 10 minutes, on the second run I'm getting this error from sentry.io
ReactorNotRestartable: null
File "huey/api.py", line 379, in _execute
task_value = task.execute()
File "huey/api.py", line 772, in execute
return func(*args, **kwargs)
File "huey/contrib/djhuey/__init__.py", line 135, in inner
return fn(*args, **kwargs)
File "apps/series/tasks.py", line 31, in scrape_all
reactor.run()
File "twisted/internet/base.py", line 1317, in run
self.startRunning(installSignalHandlers=installSignalHandlers)
File "twisted/internet/base.py", line 1299, in startRunning
ReactorBase.startRunning(cast(ReactorBase, self))
File "twisted/internet/base.py", line 843, in startRunning
raise error.ReactorNotRestartable()
Assuming at the first run twisted reactor didn't kill itself and after 10 minutes huey trying to start a twisted reactor again and fails.
I'm not proficient about multi-threads but i'm assuming task runner and twisted are running on different threads and they can't communicate with each other.
Any advices ?