Scrapy parse_item callback not being called

Question

I'm having a problem getting my Scrapy spider to run its callback method.

I don't think it's an indentation error which seems to be the case for the other previous posts, but perhaps it is and I don't know it? Any ideas?

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy import log
import tldextract

class CrawlerSpider(CrawlSpider):
  name = "crawler"

  def __init__(self, initial_url):
    log.msg('initing...', level=log.WARNING)
    CrawlSpider.__init__(self)

    if not initial_url.startswith('http'):
      initial_url = 'http://' + initial_url

    ext = tldextract.extract(initial_url)
    initial_domain = ext.domain + '.' + ext.tld
    initial_subdomain = ext.subdomain + '.' + ext.domain + '.' + ext.tld
    self.allowed_domains = [initial_domain, 'www.' + initial_domain, initial_subdomain]
    self.start_urls = [initial_url]
    self.rules = [
        Rule(SgmlLinkExtractor(), callback='parse_item'),
        Rule(SgmlLinkExtractor(allow_domains=self.allowed_domains), follow=True),
    ]
    self._compile_rules()

  def parse_item(self, response):
    log.msg('parse_item...', level=log.WARNING)
    hxs = HtmlXPathSelector(response)
    links = hxs.select("//a/@href").extract()
    for link in links:
      log.msg('link', level=log.WARNING)

Sample output is below; it should show a warning message with "parse_item..." printed but it doesn't.

$ scrapy crawl crawler -a initial_url=http://www.szuhanchang.com/test.html
2013-02-19 18:03:24+0000 [scrapy] INFO: Scrapy 0.16.4 started (bot: crawler)
2013-02-19 18:03:24+0000 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2013-02-19 18:03:24+0000 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-02-19 18:03:24+0000 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-02-19 18:03:24+0000 [scrapy] DEBUG: Enabled item pipelines: 
2013-02-19 18:03:24+0000 [scrapy] WARNING: initing...
2013-02-19 18:03:24+0000 [crawler] INFO: Spider opened
2013-02-19 18:03:24+0000 [crawler] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2013-02-19 18:03:24+0000 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-02-19 18:03:24+0000 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-02-19 18:03:25+0000 [crawler] DEBUG: Crawled (200) <GET http://www.szuhanchang.com/test.html> (referer: None)
2013-02-19 18:03:25+0000 [crawler] DEBUG: Filtered offsite request to 'www.20130219-0606.com': <GET http://www.20130219-0606.com/>
2013-02-19 18:03:25+0000 [crawler] INFO: Closing spider (finished)
2013-02-19 18:03:25+0000 [crawler] INFO: Dumping Scrapy stats:
        {'downloader/request_bytes': 234,
         'downloader/request_count': 1,
         'downloader/request_method_count/GET': 1,
         'downloader/response_bytes': 363,
         'downloader/response_count': 1,
         'downloader/response_status_count/200': 1,
         'finish_reason': 'finished',
         'finish_time': datetime.datetime(2013, 2, 19, 18, 3, 25, 84855),
         'log_count/DEBUG': 8,
         'log_count/INFO': 4,
         'log_count/WARNING': 1,
         'request_depth_max': 1,
         'response_received_count': 1,
         'scheduler/dequeued': 1,
         'scheduler/dequeued/memory': 1,
         'scheduler/enqueued': 1,
         'scheduler/enqueued/memory': 1,
         'start_time': datetime.datetime(2013, 2, 19, 18, 3, 24, 805064)}
2013-02-19 18:03:25+0000 [crawler] INFO: Spider closed (finished)

Thanks in advance!

How are you running this spider? From the command line with scrapy crawl crawler? — Steven Almeroth
– Steven Almeroth, Commented Feb 19, 2013 at 17:18
Through a sidekiq (queuing) worker, but I've tried it on the commandline as well with no luck. I've changed the question to include the commandline output for better clarity. — Han
– Han, Commented Feb 19, 2013 at 18:02
Please provide a short, self-contained example (sscce.org). If I pasted this code into a new spider, it wouldn't work plus I'd have to install the tldextract module, which makes testing a little tricky. — Talvalin
– Talvalin, Commented Feb 19, 2013 at 18:46

Steven Almeroth · Accepted Answer · 2013-02-19 20:02:27Z

4

The start_urls of http://www.szuhanchang.com/test.html has only one anchor link, namely:

<a href="http://www.20130219-0606.com">Test</a>

which contains a link to the domain 20130219-0606.com and according to your allowed_domains of:

['szuhanchang.com', 'www.szuhanchang.com', 'www.szuhanchang.com']

this Request gets filtered by the OffsiteMiddleware:

2013-02-19 18:03:25+0000 [crawler] DEBUG: Filtered offsite request to 'www.20130219-0606.com': <GET http://www.20130219-0606.com/>

therefore parse_item will not be called for this url.

answered Feb 19, 2013 at 20:02

Steven Almeroth

8,2522 gold badges54 silver badges60 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Han Over a year ago

I have two rules though, one rule contains the allowed domains and should not follow offsite URLs, and the other says to run the callback on every URL found.

Steven Almeroth Over a year ago

The second rule will never be processed because a link will only be processed once, by the first rule that it satisfies, and the first rule is satisfied by all links (by default the href in an anchor tag). And another thing is that the link extractor will pull out links according to its parameters, in your case, the allow_domains argument, (if say this rule were first) but this list of links does not override the OffsiteMiddleware and in this case, it is filtered out.

Talvalin · Accepted Answer · 2013-02-19 18:54:01Z

1

Changing the name of your callback to parse_start_url seems to work, although since the test URL provided is quite small, I cannot be sure if this will still be effective. Give it a go and let me know. :)

answered Feb 19, 2013 at 18:54

Talvalin

7,8972 gold badges33 silver badges40 bronze badges

2 Comments

Han Over a year ago

Unfortunately this didn't work; it would be weird if it did, since 'parse_item' isn't an implemented method in any of the Crawler parent classes and many examples online use that exact callback method name.

Talvalin Over a year ago

It works on mine, but then I had to hardcode initial_domain and initial_subdomain to remove the tldextract references so it's not the same as the code above. If you could post a non-working example without using that module, then that would be better.

Collectives™ on Stack Overflow

Scrapy parse_item callback not being called

2 Answers 2

2 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related