Scrapy running from python script processes only start url

Question

I have written a Scrapy CrawlSpider.

class SiteCrawlerSpider(CrawlSpider):
    name = 'site_crawler'

    def __init__(self, start_url, **kw):
        super(SiteCrawlerSpider, self).__init__(**kw)

        self.rules = (
            Rule(LinkExtractor(allow=()), callback='parse_start_url', follow=True),
        )
        self.start_urls = [start_url]
        self.allowed_domains = tldextract.extract(start_url).registered_domain

    def parse_start_url(self, response):
        external_links = LinkExtractor(allow=(), deny=self.allowed_domains).extract_links(response)
        for link in external_links:
            i = FastcrawlerItem()
            i['pageurl'] = response.url
            i['ext_link'] = link.url
            i['ext_domain'] = tldextract.extract(link.url).registered_domain                
            yield i

Now I am trying to run this script from another Python script as follows:

from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log, signals
from scrapy_fastcrawler.spiders.site_crawler import SiteCrawlerSpider
from scrapy.utils.project import get_project_settings

spider = SiteCrawlerSpider(start_url='http://www.health.com/')
settings = get_project_settings()
crawler = Crawler(settings)
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run()

Problem: Everything runs fine, but the major issue here is that the script processes only the 'start_url' and stops. It does not crawl and move to other links found on the start url and no processing being done. I have also setup pipelines and the items from start_url are correctly being saved to the pipeline setup.

Any help is greatly appreciated.

tegancp · Accepted Answer · 2015-06-17 20:20:49Z

1

When you override the default parse_start_url for a crawl spider, the method has to yield Requests for the spider to follow, otherwise it can't go anywhere.

You are not required to implement this method when subclassing CrawlSpider, and from the rest of your code, it looks like you really don't want to; try changing the method you have defined to something like parse_page (just don't call it parse).

answered Jun 17, 2015 at 20:20

tegancp

1,2026 silver badges13 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Vikas Ojha Over a year ago

Thanks for the help. I just tried this. But it doesnt work and now since it is actually not parsing the start url, I am not even getting the items from the start url page.

tegancp Over a year ago

Have you tried the code with the parse_start_url, but have that method also return Requests for the links the spider should follow?

Vikas Ojha Over a year ago

That is the problem. I think I dont know how to do that. Could you post a trivial example?

tegancp Over a year ago

the scrapy docs have examples showing how to return Requests from parse methods

Vikas Ojha Over a year ago

Thanks. This all was a great help. Got it working finally.

Collectives™ on Stack Overflow

Scrapy running from python script processes only start url

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related