I have been tasked with building a scraper for a property site where the results will be stored for later processing. The site in question is a national site and will not yield all its content in a single search and it expects you to provide a region before providing the results. To get around this I have created a scraper using scrapy using multiple start URL’s which takes me directly to the regions I’m interested in. The site is also dynamically populated so I’m using selenium to render the javascript on the page and then following the next button until the scraper has completed for each region. This works well when there is a single start URL however as soon as there is more than one URL I run into a problem. Initially the scraper works fine however before the webdriver has finished following the ‘next’ button to the end of a region (e.g. there may be 20 pages to follow for a single region) the scraper moves onto the next region (start URL) only partially scraping the first regions content. I’ve looked extensively for a solution to this however I’ve yet to see anyone with this particular issue. Any suggestions would be most welcome. Example code below:
from scrapy.spider import CrawlSpider
from scrapy.http import TextResponse
from scrapy.selector import HtmlXPathSelector
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
import time
from selenium import webdriver
from selenium import selenium
from selenium_spider.items import DemoSpiderItem
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
import sys
class DemoSpider(CrawlSpider):
name="Demo"
allowed_domains = ['example.com']
start_urls= ["http://www.example.co.uk/locationIdentifier=REGION 1234",
"http://www.example.co.uk/property-for-sale/locationIdentifier=REGION 5678"]
def __init__(self):
self.driver = webdriver.Firefox()
def __del__(self):
self.selenium.stop()
def parse (self, response):
self.driver.get(response.url)
result = response.xpath('//*[@class="l-searchResults"]')
source = 'aTest'
while True:
try:
element = WebDriverWait(self.driver, 10).until(
EC.element_to_be_clickable((By.CSS_SELECTOR,".pagination-button.pagination-direction.pagination-direction--next"))
)
print "Scraping new site --------------->", result
print "This is the result----------->", result
for properties in result:
saleOrRent = properties.xpath('//*[@class = "property-title"]/text()').extract()
addresses = properties.xpath('//*[@class="property-address"]/text()').extract()
if saleOrRent:
saleOrRent = saleOrRent[0]
if 'for sale' in saleOrRent:
saleOrRent = 'For Sale'
elif 'to rent' in saleOrRent:
saleOrRent = 'To Rent'
for a in addresses:
item = DemoSpiderItem()
address = a
item ["saleOrRent"] = saleOrRent
item ["source"] = source
item ["address"] = address
item ["response"] = response
yield item
element.click()
except TimeoutException:
break