How to stop/kill achieved Scrapy spider instance within RStudio

Ask Question

Asked 9 days ago

Modified 5 days ago

Viewed 99 times

I'm making a tutorial on how to scrape with Scrapy. For that, I use Quarto/RStudio and the website https://quotes.toscrape.com/. For pedagogic purposes, I need to run a first crawl on the first page, then a second crawl on the same level pages, then a third crawl on the sub-level biography pages. Each time I rewrite my Scrapy class with additional data to increase its functionnality to do so. So each time I have to rerun my spider.

The first scraping achieves perfectly, but when I launch the second scraping, I've got an ReactorNotRestartable error.

Here is my code (I've simplified the pedagogic speach and removed Quarto's elements) :

import scrapy
from scrapy.crawler import CrawlerProcess
import nest_asyncio
import json

Scrapy Class to scrap first page :

class ScraperQuotesToScrapeSpider(scrapy.Spider):
    name = 'scraper_quotes_to_scrape'
    allowed_domains = ['https://quotes.toscrape.com']
    start_urls = ['https://quotes.toscrape.com/']

    def parse(self, response):
        quotes_elements = response.css('div.quote')
        for quote_element in quotes_elements:
            author = quote_element.css('small.author::text').get()
            quote = quote_element.css('span.text::text').get()
            tags = quote_element.css('div.tags a.tag::text').getall()

            quotes = {
                'author': author,
                'quote': quote,
                'tags': tags
            }
            yield quotes

Execution :

nest_asyncio.apply()
process = CrawlerProcess(
    settings={
        "FEEDS": {
            "quotes.json": {"format": "json", "overwrite": "True"},
        },
    }
)

process.crawl(ScraperQuotesToScrapeSpider)
process.start()

Reading data :

with open('quotes.json', 'r') as f:
  for line in f:
    print (line)

Now, we modify the Scrapy Class to scrap same level pages :

class ScraperQuotesToScrapeSpider(scrapy.Spider):
    name = 'scraper_quotes_to_scrape'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['https://quotes.toscrape.com/']

    def parse(self, response):
        print("Processing ", response.url)
        quotes_elements = response.css('div.quote')
        for quote_element in quotes_elements:
            author = quote_element.css('small.author::text').get()
            quote = quote_element.css('span.text::text').get()
            tags = quote_element.css('div.tags a.tag::text').getall()

            quotes = {
                'author': author,
                'quote': quote,
                'tags': tags
            }
            yield quotes

        next_page = response.css('.next a::attr(href)').get()
        # domain = self.start_urls[0][0:len(self.start_urls[0])-1]
        # next_page = domain + next_page
        # print("next page = ", next_page)

        if next_page is not None:
            yield response.follow(next_page, self.parse)

We launch the script and read outputed data :

nest_asyncio.apply()
process = CrawlerProcess(
    settings={
        "FEEDS": {
            "quotes.json": {"format": "json", "overwrite": "True"},
        },
    }
)
process.crawl(ScraperQuotesToScrapeSpider)
process.start()
with open('quotes.json', 'r') as f:
  for line in f:
    print (line)

Here I get the ReactorNotRestartable error.

So my question is : how to stop the first spider instance after its execution ?

I imagine that I can't reuse it because I've changed the Scrapy class.

Then, I can't comment my first spider and its execution to render in a correct way my quarto page.

Finally, I think of killing the first spider process before launching my second spider, but I didn't find a proper way to do so (I imagined a simple process.stop() but no).

edited Nov 21 at 15:42

asked Nov 20 at 10:00

Didier mac cormick

2271 gold badge2 silver badges9 bronze badges

I have no idea what is RStudio (and it would be nice if you were able to rephrase your question to be more generic) but if you want to use CrawlerProcess more than once, you need to do it in different processes, there is no official way to go around that.

wRAR
– wRAR

2025-11-26 09:18:45 +00:00
Commented Nov 26 at 9:18

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

How to stop/kill achieved Scrapy spider instance within RStudio

0

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest