2

I am trying to scrape this website (that has multiple pages), using scrapy. the problem is that I can't find the next page URL. Do you have an idea on how to scrape a website with multiple pages (with scrapy) or how to solve the error I'm getting with my code?

I tried the code below but it's not working:

class AbcdspiderSpider(scrapy.Spider):
    """
    Class docstring
    """
    name = 'abcdspider'
    allowed_domains = ['abcd-terroir.smartrezo.com']

    alphabet = list(string.ascii_lowercase)
    url = "https://abcd-terroir.smartrezo.com/n31-france/annuaireABCD.html?page=1&spe=1&anIDS=31&search="
    start_urls = [url + letter for letter in alphabet]

    main_url = "https://abcd-terroir.smartrezo.com/n31-france/"


    crawl_datetime = str(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"))
    start_time = datetime.datetime.now()

    def parse(self, response):
        self.crawler.stats.set_value("start_time", self.start_time)
        try:
            page = response.xpath('//div[@class="pageStuff"]/span/text()').get()
            page_max = get_num_page(page)

            for index in range(page_max):
                producer_list = response.xpath('//div[@class="clearfix encart_ann"]/@onclick').getall()
                for producer in producer_list:
                    link_producer = self.main_url + producer
                    yield scrapy.Request(url=link_producer, callback=self.parse_details)

                next_page_url = "/annuaireABCD.html?page={}&spe=1&anIDS=31&search=".format(index)

                if next_page_url is not None:
                    yield scrapy.Request(response.urljoin(self.main_url + next_page_url))

        except Exception as e:
            self.crawler.stats.set_value("error", e.args)

I am getting this error:

'error': ('range() integer end argument expected, got unicode.',)

1 Answer 1

2

The error is here:

page = response.xpath('//div[@class="pageStuff"]/span/text()').get()
page_max = get_num_page(page)

The range function expected an integer value (1,2,3,4, etc) not an unicode string ('Page 1 / 403' )

My proposal for the range error is

page = response.xpath('//div[@class="pageStuff"]/span/text()').get().split('/ ')[1]

for index in range(int(page)):
    #your actions
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.