0

Here is my Spider:

import scrapy
import urlparse
from scrapy.http import Request

class BasicSpider(scrapy.Spider):
    name = "basic2"
    allowed_domains = ["cnblogs"]
    start_urls = (
        'http://www.cnblogs.com/kylinlin/',
    )

    def parse(self, response):
        next_site = response.xpath(".//*[@id='nav_next_page']/a/@href")
        for url in next_site.extract():
            yield Request(urlparse.urljoin(response.url,url))
        
        item_selector = response.xpath(".//*[@class='postTitle']/a/@href")
        for url in item_selector.extract():
            yield Request(url=urlparse.urljoin(response.url, url),
                          callback=self.parse_item)
    
    def parse_item(self, response):
        print "+=====================>>test"

Here is the output: 2016-08-12 14:46:20 [scrapy] INFO: Spider opened
2016-08-12 14:46:20 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-08-12 14:46:20 [scrapy] DEBUG: Telnet console listening on
127.0.0.1:6023 2016-08-12 14:46:20 [scrapy] DEBUG: Crawled (200) http://www.cnblogs.com/robots.txt> (referer: None)
2016-08-12 14:46:20 [scrapy] DEBUG: Crawled (200) http://www.cnblogs.com/kylinlin/> (referer: None)
2016-08-12 14:46:20 [scrapy] DEBUG: Filtered offsite request to 'www.cnblogs.com': http://www.cnblogs.com/kylinlin/default.html?page=2>
2016-08-12 14:46:20 [scrapy] INFO: Closing spider (finished)
2016-08-12 14:46:20 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 445,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 5113,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 8, 12, 6, 46, 20, 420000),
'log_count/DEBUG': 4,
'log_count/INFO': 7,
'offsite/domains': 1,
'offsite/filtered': 11,
'request_depth_max': 1,
'response_received_count': 2,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2016, 8, 12, 6, 46, 20, 131000)}
2016-08-12 14:46:20 [scrapy] INFO: Spider closed (finished)

Why crawled pages are 0? I cannot understand why there are no output like "+=====================>>test". Could someone help me out?

1 Answer 1

1
2016-08-12 14:46:20 [scrapy] DEBUG: Filtered offsite request to 'www.cnblogs.com': http://www.cnblogs.com/kylinlin/default.html?page=2>

and your's is set to:

allowed_domains = ["cnblogs"]

which is not even a domain. It should be:

allowed_domains = ["cnblogs.com"]
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.