Scraping within a url using scrapy

Question

I am trying to scrape craigslist using scrapy and have been successful in getting the url's but now I want to go extract data from within the page in the url . Following is the code :

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from craigslist.items import CraigslistItem

class craigslist_spider(BaseSpider):
    name = "craigslist_unique"
    allowed_domains = ["craiglist.org"]
    start_urls = [
        "http://sfbay.craigslist.org/search/sof?zoomToPosting=&query=&srchType=A&addFour=part-time",
        "http://newyork.craigslist.org/search/sof?zoomToPosting=&query=&srchType=A&addThree=internship",
    "http://seattle.craigslist.org/search/sof?zoomToPosting=&query=&srchType=A&addFour=part-time"
    ]


def parse(self, response):
   hxs = HtmlXPathSelector(response)
   sites = hxs.select("//span[@class='pl']")
   items = []
   for site in sites:
       item = CraigslistItem()
       item['title'] = site.select('a/text()').extract()
       item['link'] = site.select('a/@href').extract()
   #item['desc'] = site.select('text()').extract()
       items.append(item)
   hxs = HtmlXPathSelector(response)
   #print title, link        
   return items

I am new to scrapy and unable to figure out as to how to actually hit the url (href) and get data within the page of that url and doing that for all the urls.

Since you're crawling, use the CrawlSpider. Read the documentation for a few examples. — Blender
– Blender, Commented May 27, 2013 at 0:06

akhter wahab · Accepted Answer · 2013-05-27 07:31:51Z

Response of start_urls Received one by one in parse method

if you just want to grab information from that start_urls responses your code is almost ok. but your parse method should be in your craigslist_spider class not out side of that class.

def parse(self, response):
   hxs = HtmlXPathSelector(response)
   sites = hxs.select("//span[@class='pl']")
   items = []
   for site in sites:
       item = CraigslistItem()
       item['title'] = site.select('a/text()').extract()
       item['link'] = site.select('a/@href').extract()
       items.append(item)
   #print title, link
   return items

what if you want to get half information from start_urls and half from the anchor that is present on start_urls response ?

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    sites = hxs.select("//span[@class='pl']")
    for site in sites:
        item = CraigslistItem()
        item['title'] = site.select('a/text()').extract()
        item['link'] = site.select('a/@href').extract()
        if item['link']:
            if 'http://' not in item['link']:
                item['link'] = urljoin(response.url, item['link'])
            yield Request(item['link'],
                          meta={'item': item},
                          callback=self.anchor_page)


def anchor_page(self, response):
    hxs = HtmlXPathSelector(response)
    old_item = response.request.meta['item'] # Receiving parse Method item that was in Request meta
    # parse some more values
    #place them in old_item
    #e.g
    old_item['bla_bla']=hxs.select("bla bla").extract()
    yield old_item

you just needs to yield Request in parse method and ship your old item using meta of Request

then extract old_item in a anchor_page add new values in it and simply yield it.

alecxe · Accepted Answer · 2013-05-27 07:34:29Z

There is a problem in your xpaths - they should be relative. Here's the code:

from scrapy.item import Item, Field
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector


class CraigslistItem(Item):
    title = Field()
    link = Field()


class CraigslistSpider(BaseSpider):
    name = "craigslist_unique"
    allowed_domains = ["craiglist.org"]
    start_urls = [
        "http://sfbay.craigslist.org/search/sof?zoomToPosting=&query=&srchType=A&addFour=part-time",
        "http://newyork.craigslist.org/search/sof?zoomToPosting=&query=&srchType=A&addThree=internship",
        "http://seattle.craigslist.org/search/sof?zoomToPosting=&query=&srchType=A&addFour=part-time"
    ]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select("//span[@class='pl']")
        items = []
        for site in sites:
            item = CraigslistItem()
            item['title'] = site.select('.//a/text()').extract()[0]
            item['link'] = site.select('.//a/@href').extract()[0]
            items.append(item)
        return items

If run it via:

scrapy runspider spider.py -o output.json

you'll see in output.json:

{"link": "/sby/sof/3824966457.html", "title": "HR Admin/Tech Recruiter"}
{"link": "/eby/sof/3824932209.html", "title": "Entry Level Web Developer"}
{"link": "/sfc/sof/3824500262.html", "title": "Sr. Ruby on Rails Contractor @ Funded Startup"}
...

Hope that helps.

Collectives™ on Stack Overflow

Scraping within a url using scrapy

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related