0

I am new to Scrapy and Python. I was trying to retrive the data from https://in.bookmyshow.com/movies since i need the information of all the movies I was trying to extract the data .But there is something wrong with my code, I would like to know where I have gone wrong .

rules = ( Rule(SgmlLinkExtractor(allow=('https://in\.bookmyshow\.com/movies/.*', )), callback="parse_items", follow= True),)


def parse_items(self, response):
    for sel in response.xpath('//div[contains(@class, "movie-card")]'):
        item = Ex1Item()
        item['Moviename'] = sel.xpath('.//a[@class="__movie-name"]/text()').extract()
        item['Language'] = sel.xpath('/html/body/div[1]/div[2]/div/div[1]/div[2]/section[1]/div/div[2]/div[1]/div[1]/div/div/div[2]/div[2]/ul/li/text()').extract()
        item['Info'] = sel.xpath('.//div[@class="__rounded-box __genre"]/text()').extract()
        item['Synopsis'] = sel.xpath('/html/body/div[1]/div[2]/div[1]/div[2]/div[4]/div[2]/div[2]/blockquote/text()').extract()
        item['Release'] = sel.xpath('.//span[@class="__release-date"]/text()').extract()
        yield item

1 Answer 1

2

You code seems to be fine. Perhaps the problem is outside of the part you posted here.

This worked for me:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor


class BookmyshowSpider(CrawlSpider):
    name = "bookmyshow"
    start_urls = ['https://in.bookmyshow.com/movies']
    allowed_domains = ['bookmyshow.com']
    rules = (Rule(SgmlLinkExtractor(allow=('https://in\.bookmyshow\.com/movies/.*', )), callback="parse_items", follow= True),)

    def parse_items(self, response):
        for sel in response.xpath('//div[contains(@class, "movie-card")]'):
            item = Ex1Item()
            item['Moviename'] = sel.xpath('.//a[@class="__movie-name"]/text()').extract()
            item['Language'] = sel.xpath('/html/body/div[1]/div[2]/div/div[1]/div[2]/section[1]/div/div[2]/div[1]/div[1]/div/div/div[2]/div[2]/ul/li/text()').extract()
            item['Info'] = sel.xpath('.//div[@class="__rounded-box __genre"]/text()').extract()
            item['Synopsis'] = sel.xpath('/html/body/div[1]/div[2]/div[1]/div[2]/div[4]/div[2]/div[2]/blockquote/text()').extract()
            item['Release'] = sel.xpath('.//span[@class="__release-date"]/text()').extract()
            yield item

EDIT: Version using the standard spider class scrapy.Spider()

import scrapy

class BookmyshowSpider(scrapy.Spider):
    name = "bookmyshow"
    start_urls = ['https://in.bookmyshow.com/movies']
    allowed_domains = ['bookmyshow.com']

    def parse(self, response):
        links = response.xpath('//a/@href').re('movies/[^\/]+\/.*$')
        for url in set(links):
            url = response.urljoin(url)
            yield scrapy.Request(url, callback=self.parse_movie)

    def parse_movie(self, response):
        for sel in response.xpath('//div[contains(@class, "movie-card")]'):
            item = {}
            item['Moviename'] = sel.xpath('.//a[@class="__movie-name"]/text()').extract()
            item['Language'] = sel.xpath('/html/body/div[1]/div[2]/div/div[1]/div[2]/section[1]/div/div[2]/div[1]/div[1]/div/div/div[2]/div[2]/ul/li/text()').extract()
            item['Info'] = sel.xpath('.//div[@class="__rounded-box __genre"]/text()').extract()
            item['Synopsis'] = sel.xpath('/html/body/div[1]/div[2]/div[1]/div[2]/div[4]/div[2]/div[2]/blockquote/text()').extract()
            item['Release'] = sel.xpath('.//span[@class="__release-date"]/text()').extract()
            yield item

parse() parses all links to movie pages from the start page. parse_movie() is used as a callback for all Requests to the specific movie pages. With this version you certainly have more control over the spider behavior.

Sign up to request clarification or add additional context in comments.

3 Comments

The code is iterating multiple times and also as a infinite loop.I am able to extract the data but not in a proper order. Anyways thank you for going through my question
You are using Rules and a spider of type CrawlSpider. This type of spider will start at the start_urls and follow all links matching your rule. The spider in my example won't get stuck in an infinite loop, though it takes some time, as it crawls > 1000 movie pages. Is this the behavior you really want? If not please describe a little bit more precise what you want and I can tell you what other spider type would be a better fit
I want to take information of each movie from its page. It is more like fetching movie name from main page and its info from another page and it has to be done for every movie present in the mainpage. I don't think I have to use CrawlSpider for this.However thanks for the help.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.