0

I have written a Scrapy CrawlSpider.

class SiteCrawlerSpider(CrawlSpider):
    name = 'site_crawler'

    def __init__(self, start_url, **kw):
        super(SiteCrawlerSpider, self).__init__(**kw)

        self.rules = (
            Rule(LinkExtractor(allow=()), callback='parse_start_url', follow=True),
        )
        self.start_urls = [start_url]
        self.allowed_domains = tldextract.extract(start_url).registered_domain

    def parse_start_url(self, response):
        external_links = LinkExtractor(allow=(), deny=self.allowed_domains).extract_links(response)
        for link in external_links:
            i = FastcrawlerItem()
            i['pageurl'] = response.url
            i['ext_link'] = link.url
            i['ext_domain'] = tldextract.extract(link.url).registered_domain                
            yield i

Now I am trying to run this script from another Python script as follows:

from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log, signals
from scrapy_fastcrawler.spiders.site_crawler import SiteCrawlerSpider
from scrapy.utils.project import get_project_settings

spider = SiteCrawlerSpider(start_url='http://www.health.com/')
settings = get_project_settings()
crawler = Crawler(settings)
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run()

Problem: Everything runs fine, but the major issue here is that the script processes only the 'start_url' and stops. It does not crawl and move to other links found on the start url and no processing being done. I have also setup pipelines and the items from start_url are correctly being saved to the pipeline setup.

Any help is greatly appreciated.

1 Answer 1

1

When you override the default parse_start_url for a crawl spider, the method has to yield Requests for the spider to follow, otherwise it can't go anywhere.

You are not required to implement this method when subclassing CrawlSpider, and from the rest of your code, it looks like you really don't want to; try changing the method you have defined to something like parse_page (just don't call it parse).

Sign up to request clarification or add additional context in comments.

5 Comments

Thanks for the help. I just tried this. But it doesnt work and now since it is actually not parsing the start url, I am not even getting the items from the start url page.
Have you tried the code with the parse_start_url, but have that method also return Requests for the links the spider should follow?
That is the problem. I think I dont know how to do that. Could you post a trivial example?
the scrapy docs have examples showing how to return Requests from parse methods
Thanks. This all was a great help. Got it working finally.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.