Running Scrapy tasks in Python

Question

My Scrapy script seems to work just fine when I run it in 'one off' scenarios from the command line, but if I try running the code twice in the same python session I get this error:

"ReactorNotRestartable"

Why?

The offending code (last line throws the error):

crawler = CrawlerProcess(settings)
crawler.install()
crawler.configure()

# schedule spider
#crawler.crawl(MySpider())
spider = MySpider()
crawler.queue.append_spider(spider)

# start engine scrapy/twisted
crawler.start()

jro · Accepted Answer · 2011-11-04 10:02:14Z

11

Close to Joël's answer, but I want to elaborate a bit more than is possible in the comments. If you look at the Crawler source code, you see that the CrawlerProcess class has a start, but also a stop function. This stop function takes care of cleaning up the internals of the crawling so that the system ends up in a state from which it can start again.

So, if you want to restart the crawling without leaving your process, call crawler.stop() at the appropriate time. Later on, simply call crawler.start() again to resume operations.

Edit: in retrospect, this is not possible (due to the Twisted reactor, as mentioned in a different answer); the stop just takes care of a clean termination. Looking back at my code, I happened to have a wrapper for the Crawler processes. Below you can find some (redacted) code to make it work using Python's multiprocessing module. In this way you can more easily restart crawlers. (Note: I found the code online last month, but I didn't include the source... so if someone knows where it came from, I'll update the credits for the source.)

from scrapy import project, signals
from scrapy.conf import settings
from scrapy.crawler import CrawlerProcess
from scrapy.xlib.pydispatch import dispatcher
from multiprocessing.queues import Queue
from multiprocessing import Process

class CrawlerWorker(Process):
    def __init__(self, spider, results):
        Process.__init__(self)
        self.results = results

        self.crawler = CrawlerProcess(settings)
        if not hasattr(project, 'crawler'):
            self.crawler.install()
        self.crawler.configure()

        self.items = []
        self.spider = spider
        dispatcher.connect(self._item_passed, signals.item_passed)

    def _item_passed(self, item):
        self.items.append(item)

    def run(self):
        self.crawler.crawl(self.spider)
        self.crawler.start()
        self.crawler.stop()
        self.results.put(self.items)

# The part below can be called as often as you want
results = Queue()
crawler = CrawlerWorker(MySpider(myArgs), results)
crawler.start()
for item in results.get():
    pass # Do something with item

edited Nov 4, 2011 at 10:02

answered Nov 3, 2011 at 11:30

jro

9,5002 gold badges35 silver badges39 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Trindaz Over a year ago

adding crawler.stop() immediately after crawler.start() didn't help - how do I discover the "appropriate time"?

jro Over a year ago

@Trindaz: I wasn't correct on that call, please see the updated answer.

Trindaz Over a year ago

Thanks for the update @jro. I've seen this snippet before too and, if I've interpreted that correctly, then the concept is you can scrape as much as you want by adding spiders to a crawler that never dies, rather than trying to restart a crawler for every attempt you make at "executing" a spider. I've marked this as a solution because technically it solves my problem but is unusable to me because I don't want to rely to persistent crawler objects in the django application I'm using this in. I ended up writing a solution based purely on BeautifulSoup and urllib2.

John Mee Over a year ago

I'd guess you found it here... tryolabs.com/Blog/2011/09/27/calling-scrapy-python-script

Sam Stoelinga Over a year ago

Will this still run items through the pipelines defined in the settings?

|

warvariuc · Accepted Answer · 2011-11-03 11:33:03Z

1

crawler.start() starts Twisted reactor. There can be only one reactor.

If you want to run more spiders - use

another_spider = MyAnotherSpider()
crawler.queue.append_spider(another_spider)

answered Nov 3, 2011 at 11:33

warvariuc

60.1k45 gold badges183 silver badges234 bronze badges

5 Comments

goh Over a year ago

scrapy 0.14 does not support multiple spiders in a crawlerprocess anymore.

warvariuc Over a year ago

haven't tested, but this might work (from looking at the source code): crawler.engine.open_spider(another_spider)

warvariuc Over a year ago

why would you want to stop reactor?

goh Over a year ago

sending a ctrl-c interrupt signal doesn't close the spiders

goh Over a year ago

yea it did.. but i also ran into some problems with handling the spider_opened and spider_closed signals in my pipeline.. idk, says tinyurl.com/cpg55xp that it might need to configure the reactor?

Community · Accepted Answer · 2017-05-23 10:29:43Z

0

I've used threads to start reactor several time in one app and avoid ReactorNotRestartable error.

Thread(target=process.start).start()

Here is the detailed explanation: Run a Scrapy spider in a Celery Task

edited May 23, 2017 at 10:29

CommunityBot

11 silver badge

answered Mar 24, 2016 at 9:03

Denis Cherniatev

814 bronze badges

Comments

Joël · Accepted Answer · 2011-11-03 11:16:46Z

-1

Seems to me that you cannot use crawler.start() command twice: you may have to re-create it if you want it to run a second time.

answered Nov 3, 2011 at 11:16

Joël

2,8491 gold badge22 silver badges38 bronze badges

Collectives™ on Stack Overflow

Running Scrapy tasks in Python

4 Answers 4

6 Comments

5 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

6 Comments

5 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related