0

I have a scrapy project and I want to run my spider every day so I use celery to do that. this is my tasks.py file:

from celery import Celery, shared_task
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from scrapy_project.scrapy_project.spiders import myspider

app = Celery('tasks', broker='redis://localhost:6379/0')

@shared_task
def scrape_news_website():
    print('SCRAPING RIHGT NOW!')
    setting = get_project_settings()
    process = CrawlerProcess(get_project_settings())
    process.crawl(myspider)
    process.start(stop_after_crawl=False)

I've set stop_after_crawl=False because when it is True then after the first scrape I get this error:

raise error.ReactorNotRestartable() 
twisted.internet.error.ReactorNotRestartable

now with setting stop_after_crawl to False another problem shows up and the problem is that after four(it is four because concurrency is four) times of scraping celery worker doesn't work anymore and it doesn't do tasks because previous crawlprocesses are still running so there is no free worker child process. I don't know how to fix it. I would appreciate your help.

1 Answer 1

0

The issue you're facing with Celery and Scrapy seems to be related to the fact that Scrapy's reactor is not restartable by default, and when you set stop_after_crawl=False, it keeps the reactor running even after a crawl, which can cause issues when trying to run multiple crawls in the same process. Here's how you can solve these problems:

Try this variant for fix this problem.

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from scrapy_project.scrapy_project.spiders import myspider

def run_spider():
    settings = get_project_settings()
    process = CrawlerProcess(settings)
    process.crawl(myspider)
    process.start()

@shared_task
def scrape_news_website():
    print('SCRAPING RIGHT NOW!')
    run_spider()

Regarding the issue where the Celery worker doesn't work anymore after multiple scrapes, you should ensure that you manage the worker child processes properly.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.