Scraping data from current as well as nested links at the same time using Scrapy

Question

I am fairly new to scraping pages using Scrapy. While trying to scrape quotes along with details of each author from the respective links for them I encountered a problem.

import scrapy

class QuotesProject(scrapy.Spider):
    name = 'quote'
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        item = {}
        for x in response.css('.quote'):
            item['quote'] = x.css('.text::text').get()
            item['author'] = x.css('.author::text').get()
            item['href'] = response.urljoin(x.css('a::attr(href)').get())

            yield scrapy.Request(item['href'], callback=self.parse_inside, meta={'item': item})

    def parse_inside(self, response):
        item = response.meta['item']
        item['aauthor'] = response.css('h3::text').get()
        return item

The desired output for each quote is as follows, where author and aauthor should have the same value(but aauthor is fetched from another page):

{'quote': '“A day without sunshine is like, you know, night.”', 'author': 'Steve Martin', 'href': 'http://quotes.toscrape.com/author/Steve-Martin', 'aauthor': 'Steve Martin'}

However I'm getting quite unexpected output

2019-04-04 15:45:52 [scrapy.core.engine] INFO: Spider opened
2019-04-04 15:45:52 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-04-04 15:45:52 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-04-04 15:45:53 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2019-04-04 15:45:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/> (referer: None)
2019-04-04 15:45:53 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET http://quotes.toscrape.com/author/Albert-Einstein> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2019-04-04 15:45:53 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://quotes.toscrape.com/author/Andre-Gide/> from <GET http://quotes.toscrape.com/author/Andre-Gide>
2019-04-04 15:45:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/author/Andre-Gide/> (referer: http://quotes.toscrape.com/)
2019-04-04 15:45:54 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://quotes.toscrape.com/author/Albert-Einstein/> from <GET http://quotes.toscrape.com/author/Albert-Einstein>
2019-04-04 15:45:54 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://quotes.toscrape.com/author/Marilyn-Monroe/> from <GET http://quotes.toscrape.com/author/Marilyn-Monroe>
2019-04-04 15:45:54 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://quotes.toscrape.com/author/J-K-Rowling/> from <GET http://quotes.toscrape.com/author/J-K-Rowling>
2019-04-04 15:45:54 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://quotes.toscrape.com/author/Eleanor-Roosevelt/> from <GET http://quotes.toscrape.com/author/Eleanor-Roosevelt>
2019-04-04 15:45:54 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://quotes.toscrape.com/author/Steve-Martin/> from <GET http://quotes.toscrape.com/author/Steve-Martin>
2019-04-04 15:45:54 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://quotes.toscrape.com/author/Jane-Austen/> from <GET http://quotes.toscrape.com/author/Jane-Austen>
2019-04-04 15:45:54 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://quotes.toscrape.com/author/Thomas-A-Edison/> from <GET http://quotes.toscrape.com/author/Thomas-A-Edison>
2019-04-04 15:45:54 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/Andre-Gide/>
{'quote': '“A day without sunshine is like, you know, night.”', 'author': 'Steve Martin', 'href': 'http://quotes.toscrape.com/author/Steve-Martin', 'aauthor': 'André Gide\n    '}
2019-04-04 15:45:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/author/J-K-Rowling/> (referer: http://quotes.toscrape.com/)
2019-04-04 15:45:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/author/Jane-Austen/> (referer: http://quotes.toscrape.com/)
2019-04-04 15:45:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/author/Eleanor-Roosevelt/> (referer: http://quotes.toscrape.com/)
2019-04-04 15:45:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/author/Albert-Einstein/> (referer: http://quotes.toscrape.com/)
2019-04-04 15:45:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/author/Marilyn-Monroe/> (referer: http://quotes.toscrape.com/)
2019-04-04 15:45:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/author/Steve-Martin/> (referer: http://quotes.toscrape.com/)
2019-04-04 15:45:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/author/Thomas-A-Edison/> (referer: http://quotes.toscrape.com/)
2019-04-04 15:45:54 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/J-K-Rowling/>
{'quote': '“A day without sunshine is like, you know, night.”', 'author': 'Steve Martin', 'href': 'http://quotes.toscrape.com/author/Steve-Martin', 'aauthor': 'J.K. Rowling\n    '}
2019-04-04 15:45:54 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/Jane-Austen/>
{'quote': '“A day without sunshine is like, you know, night.”', 'author': 'Steve Martin', 'href': 'http://quotes.toscrape.com/author/Steve-Martin', 'aauthor': 'Eleanor Roosevelt\n    '}
2019-04-04 15:45:54 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/Eleanor-Roosevelt/>
{'quote': '“A day without sunshine is like, you know, night.”', 'author': 'Steve Martin', 'href': 'http://quotes.toscrape.com/author/Steve-Martin', 'aauthor': 'Marilyn Monroe\n    '}
2019-04-04 15:45:54 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/Albert-Einstein/>
{'quote': '“A day without sunshine is like, you know, night.”', 'author': 'Steve Martin', 'href': 'http://quotes.toscrape.com/author/Steve-Martin', 'aauthor': 'Steve Martin\n    '}
2019-04-04 15:45:54 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/Marilyn-Monroe/>
{'quote': '“A day without sunshine is like, you know, night.”', 'author': 'Steve Martin', 'href': 'http://quotes.toscrape.com/author/Steve-Martin', 'aauthor': 'Steve Martin\n    '}
2019-04-04 15:45:54 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/Steve-Martin/>
{'quote': '“A day without sunshine is like, you know, night.”', 'author': 'Steve Martin', 'href': 'http://quotes.toscrape.com/author/Steve-Martin', 'aauthor': 'Steve Martin\n    '}
2019-04-04 15:45:54 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/Thomas-A-Edison/>
{'quote': '“A day without sunshine is like, you know, night.”', 'author': 'Steve Martin', 'href': 'http://quotes.toscrape.com/author/Steve-Martin', 'aauthor': 'Thomas A. Edison\n    '}

It seems to complete all iterations of the parse() method and use the last item dictionary for later links. But if that's the case, all the aauthor values should have been the same. I searched a lot for the solution, but everything was beyond what I could understand at this point.Also, the requests seem to be asynchronous.

Would appreciate if someone explains the problem along with a working solution

vezunchik · Accepted Answer · 2019-04-04 10:55:01Z

1

Your code is good, just move item creation to cycle, otherwise this is the same object with the same data:

def parse(self, response):
    for x in response.css('.quote'):
        item = {}
        item['quote'] = x.css('.text::text').get()
        item['author'] = x.css('.author::text').get()
        item['href'] = response.urljoin(x.css('a::attr(href)').get())
        yield scrapy.Request(item['href'], callback=self.parse_inside, meta={'item': item})

answered Apr 4, 2019 at 10:55

vezunchik

3,7173 gold badges20 silver badges26 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

dkhost07 Over a year ago

Thanks. It worked!! I know that it's a basic Python question, but should not the values get overridden each time the loop runs and fetches a new value?

vezunchik Over a year ago

As far as I understand, you have a "link" to one object, so since going through loop is faster than async calling new requests, you have just last version of data in this object. Correct me, if I am wrong.

Collectives™ on Stack Overflow

Scraping data from current as well as nested links at the same time using Scrapy

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related