Scraping dynamic content using selenium and Scrapy with multiple start URL's

Question

I have been tasked with building a scraper for a property site where the results will be stored for later processing. The site in question is a national site and will not yield all its content in a single search and it expects you to provide a region before providing the results. To get around this I have created a scraper using scrapy using multiple start URL’s which takes me directly to the regions I’m interested in. The site is also dynamically populated so I’m using selenium to render the javascript on the page and then following the next button until the scraper has completed for each region. This works well when there is a single start URL however as soon as there is more than one URL I run into a problem. Initially the scraper works fine however before the webdriver has finished following the ‘next’ button to the end of a region (e.g. there may be 20 pages to follow for a single region) the scraper moves onto the next region (start URL) only partially scraping the first regions content. I’ve looked extensively for a solution to this however I’ve yet to see anyone with this particular issue. Any suggestions would be most welcome. Example code below:

from scrapy.spider                  import CrawlSpider
from scrapy.http                    import TextResponse
from scrapy.selector                import HtmlXPathSelector
from selenium                       import webdriver
from selenium.webdriver.common.by   import By
from selenium.webdriver.support.ui  import WebDriverWait
from selenium.webdriver.support     import expected_conditions as EC
from selenium.common.exceptions     import TimeoutException
import time
from selenium                       import webdriver
from selenium                       import selenium
from selenium_spider.items          import DemoSpiderItem
from selenium.webdriver.support.ui  import WebDriverWait
from selenium.webdriver.support     import expected_conditions as EC
from selenium.common.exceptions     import TimeoutException
import sys

class DemoSpider(CrawlSpider):
    name="Demo"
    allowed_domains = ['example.com']
    start_urls= ["http://www.example.co.uk/locationIdentifier=REGION    1234",
    "http://www.example.co.uk/property-for-sale/locationIdentifier=REGION    5678"]

    def __init__(self):
        self.driver = webdriver.Firefox()

    def __del__(self):
        self.selenium.stop()

    def parse (self, response):
        self.driver.get(response.url)


        result = response.xpath('//*[@class="l-searchResults"]')
        source = 'aTest'
        while True:
            try:
                element = WebDriverWait(self.driver, 10).until(
            EC.element_to_be_clickable((By.CSS_SELECTOR,".pagination-button.pagination-direction.pagination-direction--next"))
            )
                print "Scraping new site --------------->", result
                print "This is the result----------->", result
                for properties in result:
                    saleOrRent = properties.xpath('//*[@class = "property-title"]/text()').extract()
                    addresses = properties.xpath('//*[@class="property-address"]/text()').extract()
                    if saleOrRent:
                        saleOrRent = saleOrRent[0]
                        if 'for sale' in saleOrRent:
                            saleOrRent = 'For Sale'
                        elif 'to rent' in saleOrRent:
                            saleOrRent = 'To Rent'
                for a in addresses:
                    item = DemoSpiderItem()
                    address = a
                    item ["saleOrRent"] = saleOrRent
                    item ["source"] = source
                    item ["address"] = address
                    item ["response"] = response
                    yield item
                element.click()
            except TimeoutException:
                    break

I have the exact same problem! Have you found a solution yet? I'm currently looking as well, if I come across something I'll let you know. — Irina Anastasiu
– Irina Anastasiu, Commented Jun 23, 2016 at 21:54

Irina Anastasiu · Accepted Answer · 2016-06-23 22:39:08Z

I've actually just played around a bit and it turns out to be easier than I thought. You only pass one initial url in start_urls, create your separate list of manual subsequent urls to yield a manual Request with the parse function as a callback and use a counter to access the index for the correct url in manual_urls to pass it to the request.

This way, you can decide yourself when the next url is loaded, once e.g. you get no more results. Only downside here is that it's sequential, but well... :-)

See code:

import scrapy from scrapy.http.request 
import Request from selenium 
import webdriver from scrapy.selector 
import Selector from products_scraper.items import ProductItem

class ProductsSpider(scrapy.Spider):
    name = "products_spider"
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com/first']

    global manual_urls
    manual_urls = [
    'http://www.example.com/second',
    'http://www.example.com/third'
    ]

    global manual_url_index 
    manual_url_index = 0

    def __init__(self):
        self.driver = webdriver.Firefox()

    def parse(self, response):

        self.driver.get(response.url)

        hasPostings = True

        while hasPostings:
            next = self.driver.find_element_by_xpath('//dd[@class="next-page"]/a')

            try:
                next.click()
                self.driver.set_script_timeout(30)
                products = self.driver.find_elements_by_css_selector('.products-list article')

                if(len(products) == 0): 
                    if(manual_url_index < len(manual_urls)):
                        yield Request(manual_urls[manual_url_index],
                            callback=self.parse)
                        global manual_url_index
                        manual_url_index += 1

                    hasPostings = False

                for product in products:
                    item = ProductItem()
                    # store product info here
                    yield item 

            except Exception, e:
                print str(e)
                break



        def spider_closed(self, spider):
            self.driver.quit()

Collectives™ on Stack Overflow

Scraping dynamic content using selenium and Scrapy with multiple start URL's

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related