Getting data from multiple links using scrapy

Question

I am new to Scrapy and Python. I was trying to retrive the data from https://in.bookmyshow.com/movies since i need the information of all the movies I was trying to extract the data .But there is something wrong with my code, I would like to know where I have gone wrong .

rules = ( Rule(SgmlLinkExtractor(allow=('https://in\.bookmyshow\.com/movies/.*', )), callback="parse_items", follow= True),)


def parse_items(self, response):
    for sel in response.xpath('//div[contains(@class, "movie-card")]'):
        item = Ex1Item()
        item['Moviename'] = sel.xpath('.//a[@class="__movie-name"]/text()').extract()
        item['Language'] = sel.xpath('/html/body/div[1]/div[2]/div/div[1]/div[2]/section[1]/div/div[2]/div[1]/div[1]/div/div/div[2]/div[2]/ul/li/text()').extract()
        item['Info'] = sel.xpath('.//div[@class="__rounded-box __genre"]/text()').extract()
        item['Synopsis'] = sel.xpath('/html/body/div[1]/div[2]/div[1]/div[2]/div[4]/div[2]/div[2]/blockquote/text()').extract()
        item['Release'] = sel.xpath('.//span[@class="__release-date"]/text()').extract()
        yield item

dron22 · Accepted Answer · 2016-03-22 19:00:13Z

2

You code seems to be fine. Perhaps the problem is outside of the part you posted here.

This worked for me:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor


class BookmyshowSpider(CrawlSpider):
    name = "bookmyshow"
    start_urls = ['https://in.bookmyshow.com/movies']
    allowed_domains = ['bookmyshow.com']
    rules = (Rule(SgmlLinkExtractor(allow=('https://in\.bookmyshow\.com/movies/.*', )), callback="parse_items", follow= True),)

    def parse_items(self, response):
        for sel in response.xpath('//div[contains(@class, "movie-card")]'):
            item = Ex1Item()
            item['Moviename'] = sel.xpath('.//a[@class="__movie-name"]/text()').extract()
            item['Language'] = sel.xpath('/html/body/div[1]/div[2]/div/div[1]/div[2]/section[1]/div/div[2]/div[1]/div[1]/div/div/div[2]/div[2]/ul/li/text()').extract()
            item['Info'] = sel.xpath('.//div[@class="__rounded-box __genre"]/text()').extract()
            item['Synopsis'] = sel.xpath('/html/body/div[1]/div[2]/div[1]/div[2]/div[4]/div[2]/div[2]/blockquote/text()').extract()
            item['Release'] = sel.xpath('.//span[@class="__release-date"]/text()').extract()
            yield item

EDIT: Version using the standard spider class scrapy.Spider()

import scrapy

class BookmyshowSpider(scrapy.Spider):
    name = "bookmyshow"
    start_urls = ['https://in.bookmyshow.com/movies']
    allowed_domains = ['bookmyshow.com']

    def parse(self, response):
        links = response.xpath('//a/@href').re('movies/[^\/]+\/.*$')
        for url in set(links):
            url = response.urljoin(url)
            yield scrapy.Request(url, callback=self.parse_movie)

    def parse_movie(self, response):
        for sel in response.xpath('//div[contains(@class, "movie-card")]'):
            item = {}
            item['Moviename'] = sel.xpath('.//a[@class="__movie-name"]/text()').extract()
            item['Language'] = sel.xpath('/html/body/div[1]/div[2]/div/div[1]/div[2]/section[1]/div/div[2]/div[1]/div[1]/div/div/div[2]/div[2]/ul/li/text()').extract()
            item['Info'] = sel.xpath('.//div[@class="__rounded-box __genre"]/text()').extract()
            item['Synopsis'] = sel.xpath('/html/body/div[1]/div[2]/div[1]/div[2]/div[4]/div[2]/div[2]/blockquote/text()').extract()
            item['Release'] = sel.xpath('.//span[@class="__release-date"]/text()').extract()
            yield item

parse() parses all links to movie pages from the start page. parse_movie() is used as a callback for all Requests to the specific movie pages. With this version you certainly have more control over the spider behavior.

edited Mar 22, 2016 at 19:00

answered Mar 21, 2016 at 23:21

dron22

1,23310 silver badges20 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Shraddha Over a year ago

The code is iterating multiple times and also as a infinite loop.I am able to extract the data but not in a proper order. Anyways thank you for going through my question

dron22 Over a year ago

You are using Rules and a spider of type CrawlSpider. This type of spider will start at the start_urls and follow all links matching your rule. The spider in my example won't get stuck in an infinite loop, though it takes some time, as it crawls > 1000 movie pages. Is this the behavior you really want? If not please describe a little bit more precise what you want and I can tell you what other spider type would be a better fit

Shraddha Over a year ago

I want to take information of each movie from its page. It is more like fetching movie name from main page and its info from another page and it has to be done for every movie present in the mainpage. I don't think I have to use CrawlSpider for this.However thanks for the help.

Collectives™ on Stack Overflow

Getting data from multiple links using scrapy

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related