10

My Scrapy script seems to work just fine when I run it in 'one off' scenarios from the command line, but if I try running the code twice in the same python session I get this error:

"ReactorNotRestartable"

Why?

The offending code (last line throws the error):

crawler = CrawlerProcess(settings)
crawler.install()
crawler.configure()

# schedule spider
#crawler.crawl(MySpider())
spider = MySpider()
crawler.queue.append_spider(spider)

# start engine scrapy/twisted
crawler.start()

4 Answers 4

11

Close to Joël's answer, but I want to elaborate a bit more than is possible in the comments. If you look at the Crawler source code, you see that the CrawlerProcess class has a start, but also a stop function. This stop function takes care of cleaning up the internals of the crawling so that the system ends up in a state from which it can start again.

So, if you want to restart the crawling without leaving your process, call crawler.stop() at the appropriate time. Later on, simply call crawler.start() again to resume operations.

Edit: in retrospect, this is not possible (due to the Twisted reactor, as mentioned in a different answer); the stop just takes care of a clean termination. Looking back at my code, I happened to have a wrapper for the Crawler processes. Below you can find some (redacted) code to make it work using Python's multiprocessing module. In this way you can more easily restart crawlers. (Note: I found the code online last month, but I didn't include the source... so if someone knows where it came from, I'll update the credits for the source.)

from scrapy import project, signals
from scrapy.conf import settings
from scrapy.crawler import CrawlerProcess
from scrapy.xlib.pydispatch import dispatcher
from multiprocessing.queues import Queue
from multiprocessing import Process

class CrawlerWorker(Process):
    def __init__(self, spider, results):
        Process.__init__(self)
        self.results = results

        self.crawler = CrawlerProcess(settings)
        if not hasattr(project, 'crawler'):
            self.crawler.install()
        self.crawler.configure()

        self.items = []
        self.spider = spider
        dispatcher.connect(self._item_passed, signals.item_passed)

    def _item_passed(self, item):
        self.items.append(item)

    def run(self):
        self.crawler.crawl(self.spider)
        self.crawler.start()
        self.crawler.stop()
        self.results.put(self.items)

# The part below can be called as often as you want
results = Queue()
crawler = CrawlerWorker(MySpider(myArgs), results)
crawler.start()
for item in results.get():
    pass # Do something with item
Sign up to request clarification or add additional context in comments.

6 Comments

adding crawler.stop() immediately after crawler.start() didn't help - how do I discover the "appropriate time"?
@Trindaz: I wasn't correct on that call, please see the updated answer.
Thanks for the update @jro. I've seen this snippet before too and, if I've interpreted that correctly, then the concept is you can scrape as much as you want by adding spiders to a crawler that never dies, rather than trying to restart a crawler for every attempt you make at "executing" a spider. I've marked this as a solution because technically it solves my problem but is unusable to me because I don't want to rely to persistent crawler objects in the django application I'm using this in. I ended up writing a solution based purely on BeautifulSoup and urllib2.
Will this still run items through the pipelines defined in the settings?
|
1

crawler.start() starts Twisted reactor. There can be only one reactor.

If you want to run more spiders - use

another_spider = MyAnotherSpider()
crawler.queue.append_spider(another_spider)

5 Comments

scrapy 0.14 does not support multiple spiders in a crawlerprocess anymore.
haven't tested, but this might work (from looking at the source code): crawler.engine.open_spider(another_spider)
why would you want to stop reactor?
sending a ctrl-c interrupt signal doesn't close the spiders
yea it did.. but i also ran into some problems with handling the spider_opened and spider_closed signals in my pipeline.. idk, says tinyurl.com/cpg55xp that it might need to configure the reactor?
0

I've used threads to start reactor several time in one app and avoid ReactorNotRestartable error.

Thread(target=process.start).start()

Here is the detailed explanation: Run a Scrapy spider in a Celery Task

Comments

-1

Seems to me that you cannot use crawler.start() command twice: you may have to re-create it if you want it to run a second time.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.