1

I am fairly new to scraping pages using Scrapy. While trying to scrape quotes along with details of each author from the respective links for them I encountered a problem.

import scrapy

class QuotesProject(scrapy.Spider):
    name = 'quote'
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        item = {}
        for x in response.css('.quote'):
            item['quote'] = x.css('.text::text').get()
            item['author'] = x.css('.author::text').get()
            item['href'] = response.urljoin(x.css('a::attr(href)').get())

            yield scrapy.Request(item['href'], callback=self.parse_inside, meta={'item': item})

    def parse_inside(self, response):
        item = response.meta['item']
        item['aauthor'] = response.css('h3::text').get()
        return item

The desired output for each quote is as follows, where author and aauthor should have the same value(but aauthor is fetched from another page):

{'quote': '“A day without sunshine is like, you know, night.”', 'author': 'Steve Martin', 'href': 'http://quotes.toscrape.com/author/Steve-Martin', 'aauthor': 'Steve Martin'}

However I'm getting quite unexpected output

2019-04-04 15:45:52 [scrapy.core.engine] INFO: Spider opened
2019-04-04 15:45:52 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-04-04 15:45:52 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-04-04 15:45:53 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2019-04-04 15:45:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/> (referer: None)
2019-04-04 15:45:53 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET http://quotes.toscrape.com/author/Albert-Einstein> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2019-04-04 15:45:53 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://quotes.toscrape.com/author/Andre-Gide/> from <GET http://quotes.toscrape.com/author/Andre-Gide>
2019-04-04 15:45:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/author/Andre-Gide/> (referer: http://quotes.toscrape.com/)
2019-04-04 15:45:54 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://quotes.toscrape.com/author/Albert-Einstein/> from <GET http://quotes.toscrape.com/author/Albert-Einstein>
2019-04-04 15:45:54 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://quotes.toscrape.com/author/Marilyn-Monroe/> from <GET http://quotes.toscrape.com/author/Marilyn-Monroe>
2019-04-04 15:45:54 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://quotes.toscrape.com/author/J-K-Rowling/> from <GET http://quotes.toscrape.com/author/J-K-Rowling>
2019-04-04 15:45:54 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://quotes.toscrape.com/author/Eleanor-Roosevelt/> from <GET http://quotes.toscrape.com/author/Eleanor-Roosevelt>
2019-04-04 15:45:54 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://quotes.toscrape.com/author/Steve-Martin/> from <GET http://quotes.toscrape.com/author/Steve-Martin>
2019-04-04 15:45:54 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://quotes.toscrape.com/author/Jane-Austen/> from <GET http://quotes.toscrape.com/author/Jane-Austen>
2019-04-04 15:45:54 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://quotes.toscrape.com/author/Thomas-A-Edison/> from <GET http://quotes.toscrape.com/author/Thomas-A-Edison>
2019-04-04 15:45:54 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/Andre-Gide/>
{'quote': '“A day without sunshine is like, you know, night.”', 'author': 'Steve Martin', 'href': 'http://quotes.toscrape.com/author/Steve-Martin', 'aauthor': 'André Gide\n    '}
2019-04-04 15:45:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/author/J-K-Rowling/> (referer: http://quotes.toscrape.com/)
2019-04-04 15:45:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/author/Jane-Austen/> (referer: http://quotes.toscrape.com/)
2019-04-04 15:45:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/author/Eleanor-Roosevelt/> (referer: http://quotes.toscrape.com/)
2019-04-04 15:45:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/author/Albert-Einstein/> (referer: http://quotes.toscrape.com/)
2019-04-04 15:45:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/author/Marilyn-Monroe/> (referer: http://quotes.toscrape.com/)
2019-04-04 15:45:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/author/Steve-Martin/> (referer: http://quotes.toscrape.com/)
2019-04-04 15:45:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/author/Thomas-A-Edison/> (referer: http://quotes.toscrape.com/)
2019-04-04 15:45:54 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/J-K-Rowling/>
{'quote': '“A day without sunshine is like, you know, night.”', 'author': 'Steve Martin', 'href': 'http://quotes.toscrape.com/author/Steve-Martin', 'aauthor': 'J.K. Rowling\n    '}
2019-04-04 15:45:54 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/Jane-Austen/>
{'quote': '“A day without sunshine is like, you know, night.”', 'author': 'Steve Martin', 'href': 'http://quotes.toscrape.com/author/Steve-Martin', 'aauthor': 'Eleanor Roosevelt\n    '}
2019-04-04 15:45:54 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/Eleanor-Roosevelt/>
{'quote': '“A day without sunshine is like, you know, night.”', 'author': 'Steve Martin', 'href': 'http://quotes.toscrape.com/author/Steve-Martin', 'aauthor': 'Marilyn Monroe\n    '}
2019-04-04 15:45:54 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/Albert-Einstein/>
{'quote': '“A day without sunshine is like, you know, night.”', 'author': 'Steve Martin', 'href': 'http://quotes.toscrape.com/author/Steve-Martin', 'aauthor': 'Steve Martin\n    '}
2019-04-04 15:45:54 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/Marilyn-Monroe/>
{'quote': '“A day without sunshine is like, you know, night.”', 'author': 'Steve Martin', 'href': 'http://quotes.toscrape.com/author/Steve-Martin', 'aauthor': 'Steve Martin\n    '}
2019-04-04 15:45:54 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/Steve-Martin/>
{'quote': '“A day without sunshine is like, you know, night.”', 'author': 'Steve Martin', 'href': 'http://quotes.toscrape.com/author/Steve-Martin', 'aauthor': 'Steve Martin\n    '}
2019-04-04 15:45:54 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/Thomas-A-Edison/>
{'quote': '“A day without sunshine is like, you know, night.”', 'author': 'Steve Martin', 'href': 'http://quotes.toscrape.com/author/Steve-Martin', 'aauthor': 'Thomas A. Edison\n    '}

It seems to complete all iterations of the parse() method and use the last item dictionary for later links. But if that's the case, all the aauthor values should have been the same. I searched a lot for the solution, but everything was beyond what I could understand at this point.Also, the requests seem to be asynchronous.

Would appreciate if someone explains the problem along with a working solution

1 Answer 1

1

Your code is good, just move item creation to cycle, otherwise this is the same object with the same data:

def parse(self, response):
    for x in response.css('.quote'):
        item = {}
        item['quote'] = x.css('.text::text').get()
        item['author'] = x.css('.author::text').get()
        item['href'] = response.urljoin(x.css('a::attr(href)').get())
        yield scrapy.Request(item['href'], callback=self.parse_inside, meta={'item': item})
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks. It worked!! I know that it's a basic Python question, but should not the values get overridden each time the loop runs and fetches a new value?
As far as I understand, you have a "link" to one object, so since going through loop is faster than async calling new requests, you have just last version of data in this object. Correct me, if I am wrong.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.