I'm making a tutorial on how to scrape with Scrapy. For that, I use Quarto/RStudio and the website https://quotes.toscrape.com/. For pedagogic purposes, I need to run a first crawl on the first page, then a second crawl on the same level pages, then a third crawl on the sub-level biography pages. Each time I rewrite my Scrapy class with additional data to increase its functionnality to do so. So each time I have to rerun my spider.
The first scraping achieves perfectly, but when I launch the second scraping, I've got an ReactorNotRestartable error.
Here is my code (I've simplified the pedagogic speach and removed Quarto's elements) :
import scrapy
from scrapy.crawler import CrawlerProcess
import nest_asyncio
import json
Scrapy Class to scrap first page :
class ScraperQuotesToScrapeSpider(scrapy.Spider):
name = 'scraper_quotes_to_scrape'
allowed_domains = ['https://quotes.toscrape.com']
start_urls = ['https://quotes.toscrape.com/']
def parse(self, response):
quotes_elements = response.css('div.quote')
for quote_element in quotes_elements:
author = quote_element.css('small.author::text').get()
quote = quote_element.css('span.text::text').get()
tags = quote_element.css('div.tags a.tag::text').getall()
quotes = {
'author': author,
'quote': quote,
'tags': tags
}
yield quotes
Execution :
nest_asyncio.apply()
process = CrawlerProcess(
settings={
"FEEDS": {
"quotes.json": {"format": "json", "overwrite": "True"},
},
}
)
process.crawl(ScraperQuotesToScrapeSpider)
process.start()
Reading data :
with open('quotes.json', 'r') as f:
for line in f:
print (line)
Now, we modify the Scrapy Class to scrap same level pages :
class ScraperQuotesToScrapeSpider(scrapy.Spider):
name = 'scraper_quotes_to_scrape'
allowed_domains = ['quotes.toscrape.com']
start_urls = ['https://quotes.toscrape.com/']
def parse(self, response):
print("Processing ", response.url)
quotes_elements = response.css('div.quote')
for quote_element in quotes_elements:
author = quote_element.css('small.author::text').get()
quote = quote_element.css('span.text::text').get()
tags = quote_element.css('div.tags a.tag::text').getall()
quotes = {
'author': author,
'quote': quote,
'tags': tags
}
yield quotes
next_page = response.css('.next a::attr(href)').get()
# domain = self.start_urls[0][0:len(self.start_urls[0])-1]
# next_page = domain + next_page
# print("next page = ", next_page)
if next_page is not None:
yield response.follow(next_page, self.parse)
We launch the script and read outputed data :
nest_asyncio.apply()
process = CrawlerProcess(
settings={
"FEEDS": {
"quotes.json": {"format": "json", "overwrite": "True"},
},
}
)
process.crawl(ScraperQuotesToScrapeSpider)
process.start()
with open('quotes.json', 'r') as f:
for line in f:
print (line)
Here I get the ReactorNotRestartable error.
So my question is : how to stop the first spider instance after its execution ?
I imagine that I can't reuse it because I've changed the Scrapy class.
Then, I can't comment my first spider and its execution to render in a correct way my quarto page.
Finally, I think of killing the first spider process before launching my second spider, but I didn't find a proper way to do so (I imagined a simple process.stop() but no).