2

I built a web crawler with Scrapy and Django and put the CrawlerRunner code into task queue. In my local everything works fine until run the tasks in the server. I'm thinking multiple threads causing the problem.

This is the task code, I'm using huey for the tasks

from huey import crontab
from huey.contrib.djhuey import db_periodic_task, on_startup
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from scrapy.utils.project import get_project_settings
from twisted.internet import reactor

from apps.core.tasks import CRONTAB_PERIODS
from apps.scrapers.crawler1 import Crawler1
from apps.scrapers.crawler2 import Crawler2
from apps.scrapers.crawler3 import Crawler3



@on_startup(name="scrape_all__on_startup")
@db_periodic_task(crontab(**CRONTAB_PERIODS["every_10_minutes"]))
def scrape_all():
    configure_logging()
    settings = get_project_settings()

    runner = CrawlerRunner(settings=settings)

    runner.crawl(Crawler1)
    runner.crawl(Crawler2)
    runner.crawl(Crawler3)

    defer = runner.join()
    defer.addBoth(lambda _: reactor.stop())

    reactor.run()

and this is the first error I get from sentry.io, it's truncated

Unhandled Error
Traceback (most recent call last):
  File "/home/deployer/env/lib/python3.10/site-packages/twisted/internet/base.py", line 501, in fireEvent
    DeferredList(beforeResults).addCallback(self._continueFiring)
  File "/home/deployer/env/lib/python3.10/site-packages/twisted/internet/defer.py", line 532, in addCallback
    return self.addCallbacks(callback, callbackArgs=args, callbackKeywords=kwargs)
  File "/home/deployer/env/lib/python3.10/site-packages/twisted/internet/defer.py", line 512, in addCallbacks
    self._runCallbacks()
  File "/home/deployer/env/lib/python3.10/site-packages/twisted/internet/defer.py", line 892, in _runCallbacks
    current.result = callback(  # type: ignore[misc]
--- <exception caught here> ---
  File "/home/deployer/env/lib/python3.10/site-packages/twisted/internet/base.py", line 513, in _continueFiring
    callable(*args, **kwargs)
  File "/home/deployer/env/lib/python3.10/site-packages/twisted/internet/base.py", line 1314, in _reallyStartRunning
    self._handle...

the task is set to run every 10 minutes, on the second run I'm getting this error from sentry.io

ReactorNotRestartable: null
  File "huey/api.py", line 379, in _execute
    task_value = task.execute()
  File "huey/api.py", line 772, in execute
    return func(*args, **kwargs)
  File "huey/contrib/djhuey/__init__.py", line 135, in inner
    return fn(*args, **kwargs)
  File "apps/series/tasks.py", line 31, in scrape_all
    reactor.run()
  File "twisted/internet/base.py", line 1317, in run
    self.startRunning(installSignalHandlers=installSignalHandlers)
  File "twisted/internet/base.py", line 1299, in startRunning
    ReactorBase.startRunning(cast(ReactorBase, self))
  File "twisted/internet/base.py", line 843, in startRunning
    raise error.ReactorNotRestartable()

Assuming at the first run twisted reactor didn't kill itself and after 10 minutes huey trying to start a twisted reactor again and fails.

I'm not proficient about multi-threads but i'm assuming task runner and twisted are running on different threads and they can't communicate with each other.

Any advices ?

1
  • It actually has to do with processes. The twisted reactor cannot be restarted in the same process. A solution would be to run your spider in a subprocess Commented Nov 14, 2022 at 22:05

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.