2

I'm making a tutorial on how to scrape with Scrapy. For that, I use Quarto/RStudio and the website https://quotes.toscrape.com/. For pedagogic purposes, I need to run a first crawl on the first page, then a second crawl on the same level pages, then a third crawl on the sub-level biography pages. Each time I rewrite my Scrapy class with additional data to increase its functionnality to do so. So each time I have to rerun my spider.

The first scraping achieves perfectly, but when I launch the second scraping, I've got an ReactorNotRestartable error.

Here is my code (I've simplified the pedagogic speach and removed Quarto's elements) :

import scrapy
from scrapy.crawler import CrawlerProcess
import nest_asyncio
import json

Scrapy Class to scrap first page :

class ScraperQuotesToScrapeSpider(scrapy.Spider):
    name = 'scraper_quotes_to_scrape'
    allowed_domains = ['https://quotes.toscrape.com']
    start_urls = ['https://quotes.toscrape.com/']

    def parse(self, response):
        quotes_elements = response.css('div.quote')
        for quote_element in quotes_elements:
            author = quote_element.css('small.author::text').get()
            quote = quote_element.css('span.text::text').get()
            tags = quote_element.css('div.tags a.tag::text').getall()

            quotes = {
                'author': author,
                'quote': quote,
                'tags': tags
            }
            yield quotes

Execution :

nest_asyncio.apply()
process = CrawlerProcess(
    settings={
        "FEEDS": {
            "quotes.json": {"format": "json", "overwrite": "True"},
        },
    }
)

process.crawl(ScraperQuotesToScrapeSpider)
process.start()


Reading data :

with open('quotes.json', 'r') as f:
  for line in f:
    print (line)

Now, we modify the Scrapy Class to scrap same level pages :

class ScraperQuotesToScrapeSpider(scrapy.Spider):
    name = 'scraper_quotes_to_scrape'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['https://quotes.toscrape.com/']

    def parse(self, response):
        print("Processing ", response.url)
        quotes_elements = response.css('div.quote')
        for quote_element in quotes_elements:
            author = quote_element.css('small.author::text').get()
            quote = quote_element.css('span.text::text').get()
            tags = quote_element.css('div.tags a.tag::text').getall()

            quotes = {
                'author': author,
                'quote': quote,
                'tags': tags
            }
            yield quotes

        next_page = response.css('.next a::attr(href)').get()
        # domain = self.start_urls[0][0:len(self.start_urls[0])-1]
        # next_page = domain + next_page
        # print("next page = ", next_page)

        if next_page is not None:
            yield response.follow(next_page, self.parse)

We launch the script and read outputed data :

nest_asyncio.apply()
process = CrawlerProcess(
    settings={
        "FEEDS": {
            "quotes.json": {"format": "json", "overwrite": "True"},
        },
    }
)
process.crawl(ScraperQuotesToScrapeSpider)
process.start()
with open('quotes.json', 'r') as f:
  for line in f:
    print (line)

Here I get the ReactorNotRestartable error.

So my question is : how to stop the first spider instance after its execution ?

I imagine that I can't reuse it because I've changed the Scrapy class.

Then, I can't comment my first spider and its execution to render in a correct way my quarto page.

Finally, I think of killing the first spider process before launching my second spider, but I didn't find a proper way to do so (I imagined a simple process.stop() but no).

1
  • I have no idea what is RStudio (and it would be nice if you were able to rephrase your question to be more generic) but if you want to use CrawlerProcess more than once, you need to do it in different processes, there is no official way to go around that. Commented Nov 26 at 9:18

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.