3

I'm having a problem getting my Scrapy spider to run its callback method.

I don't think it's an indentation error which seems to be the case for the other previous posts, but perhaps it is and I don't know it? Any ideas?

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy import log
import tldextract

class CrawlerSpider(CrawlSpider):
  name = "crawler"

  def __init__(self, initial_url):
    log.msg('initing...', level=log.WARNING)
    CrawlSpider.__init__(self)

    if not initial_url.startswith('http'):
      initial_url = 'http://' + initial_url

    ext = tldextract.extract(initial_url)
    initial_domain = ext.domain + '.' + ext.tld
    initial_subdomain = ext.subdomain + '.' + ext.domain + '.' + ext.tld
    self.allowed_domains = [initial_domain, 'www.' + initial_domain, initial_subdomain]
    self.start_urls = [initial_url]
    self.rules = [
        Rule(SgmlLinkExtractor(), callback='parse_item'),
        Rule(SgmlLinkExtractor(allow_domains=self.allowed_domains), follow=True),
    ]
    self._compile_rules()

  def parse_item(self, response):
    log.msg('parse_item...', level=log.WARNING)
    hxs = HtmlXPathSelector(response)
    links = hxs.select("//a/@href").extract()
    for link in links:
      log.msg('link', level=log.WARNING)

Sample output is below; it should show a warning message with "parse_item..." printed but it doesn't.

$ scrapy crawl crawler -a initial_url=http://www.szuhanchang.com/test.html
2013-02-19 18:03:24+0000 [scrapy] INFO: Scrapy 0.16.4 started (bot: crawler)
2013-02-19 18:03:24+0000 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2013-02-19 18:03:24+0000 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-02-19 18:03:24+0000 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-02-19 18:03:24+0000 [scrapy] DEBUG: Enabled item pipelines: 
2013-02-19 18:03:24+0000 [scrapy] WARNING: initing...
2013-02-19 18:03:24+0000 [crawler] INFO: Spider opened
2013-02-19 18:03:24+0000 [crawler] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2013-02-19 18:03:24+0000 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-02-19 18:03:24+0000 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-02-19 18:03:25+0000 [crawler] DEBUG: Crawled (200) <GET http://www.szuhanchang.com/test.html> (referer: None)
2013-02-19 18:03:25+0000 [crawler] DEBUG: Filtered offsite request to 'www.20130219-0606.com': <GET http://www.20130219-0606.com/>
2013-02-19 18:03:25+0000 [crawler] INFO: Closing spider (finished)
2013-02-19 18:03:25+0000 [crawler] INFO: Dumping Scrapy stats:
        {'downloader/request_bytes': 234,
         'downloader/request_count': 1,
         'downloader/request_method_count/GET': 1,
         'downloader/response_bytes': 363,
         'downloader/response_count': 1,
         'downloader/response_status_count/200': 1,
         'finish_reason': 'finished',
         'finish_time': datetime.datetime(2013, 2, 19, 18, 3, 25, 84855),
         'log_count/DEBUG': 8,
         'log_count/INFO': 4,
         'log_count/WARNING': 1,
         'request_depth_max': 1,
         'response_received_count': 1,
         'scheduler/dequeued': 1,
         'scheduler/dequeued/memory': 1,
         'scheduler/enqueued': 1,
         'scheduler/enqueued/memory': 1,
         'start_time': datetime.datetime(2013, 2, 19, 18, 3, 24, 805064)}
2013-02-19 18:03:25+0000 [crawler] INFO: Spider closed (finished)

Thanks in advance!

3
  • How are you running this spider? From the command line with scrapy crawl crawler? Commented Feb 19, 2013 at 17:18
  • Through a sidekiq (queuing) worker, but I've tried it on the commandline as well with no luck. I've changed the question to include the commandline output for better clarity. Commented Feb 19, 2013 at 18:02
  • Please provide a short, self-contained example (sscce.org). If I pasted this code into a new spider, it wouldn't work plus I'd have to install the tldextract module, which makes testing a little tricky. Commented Feb 19, 2013 at 18:46

2 Answers 2

4

The start_urls of http://www.szuhanchang.com/test.html has only one anchor link, namely:

<a href="http://www.20130219-0606.com">Test</a>

which contains a link to the domain 20130219-0606.com and according to your allowed_domains of:

['szuhanchang.com', 'www.szuhanchang.com', 'www.szuhanchang.com']

this Request gets filtered by the OffsiteMiddleware:

2013-02-19 18:03:25+0000 [crawler] DEBUG: Filtered offsite request to 'www.20130219-0606.com': <GET http://www.20130219-0606.com/>

therefore parse_item will not be called for this url.

Sign up to request clarification or add additional context in comments.

2 Comments

I have two rules though, one rule contains the allowed domains and should not follow offsite URLs, and the other says to run the callback on every URL found.
The second rule will never be processed because a link will only be processed once, by the first rule that it satisfies, and the first rule is satisfied by all links (by default the href in an anchor tag). And another thing is that the link extractor will pull out links according to its parameters, in your case, the allow_domains argument, (if say this rule were first) but this list of links does not override the OffsiteMiddleware and in this case, it is filtered out.
1

Changing the name of your callback to parse_start_url seems to work, although since the test URL provided is quite small, I cannot be sure if this will still be effective. Give it a go and let me know. :)

2 Comments

Unfortunately this didn't work; it would be weird if it did, since 'parse_item' isn't an implemented method in any of the Crawler parent classes and many examples online use that exact callback method name.
It works on mine, but then I had to hardcode initial_domain and initial_subdomain to remove the tldextract references so it's not the same as the code above. If you could post a non-working example without using that module, then that would be better.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.