1

I'm learning how to work with Scrapy while refreshing my knowledge in Python?/Coding from school.

Currently, I'm playing around with imdb top 250 list but struggling with a JSON output file.

My current code is:

 # -*- coding: utf-8 -*-
import scrapy

from top250imdb.items import Top250ImdbItem


class ActorsSpider(scrapy.Spider):
    name = "actors"
    allowed_domains = ["imdb.com"]
    start_urls = ['http://www.imdb.com/chart/top']

    # Parsing each movie and preparing the url for the actors list
    def parse(self, response):
        for film in response.css('.titleColumn'):
            url = film.css('a::attr(href)').extract_first()
            actors_url = 'http://imdb.com' + url[:17] + 'fullcredits?ref_=tt_cl_sm#cast'
            yield scrapy.Request(actors_url, self.parse_actor)

    # Finding all actors and storing them on item
    # Refer to items.py
    def parse_actor(self, response):
        final_list = []
        item = Top250ImdbItem()
        item['poster'] = response.css('#main img::attr(src)').extract_first()
        item['title'] = response.css('h3[itemprop~=name] a::text').extract()
        item['photo'] = response.css('#fullcredits_content .loadlate::attr(loadlate)').extract()
        item['actors'] = response.css('td[itemprop~=actor] span::text').extract()

        final_list.append(item)

        updated_list = []

        for item in final_list:
            for i in range(len(item['title'])):
                sub_item = {}
                sub_item['movie'] = {}
                sub_item['movie']['poster'] = [item['poster']]
                sub_item['movie']['title'] = [item['title'][i]]
                sub_item['movie']['photo'] = [item['photo']]
                sub_item['movie']['actors'] = [item['actors']]
                updated_list.append(sub_item)
            return updated_list

and my output file is giving me this JSON composition:

[
  {
    "movie": {
      "poster": ["https://images-na.ssl-images-amazon.com/poster..."], 
      "title": ["The Shawshank Redemption"], 
      "photo": [["https://images-na.ssl-images-amazon.com/photo..."]], 
      "actors": [["Tim Robbins","Morgan Freeman",...]]}
    },{
    "movie": {
      "poster": ["https://images-na.ssl-images-amazon.com/poster..."], 
      "title": ["The Godfather"], 
      "photo": [["https://images-na.ssl-images-amazon.com/photo..."]], 
      "actors": [["Alexandre Rodrigues", "Leandro Firmino", "Phellipe Haagensen",...]]}
  }
]

but I'm looking to achieve this:

{
  "movies": [{
    "poster": "https://images-na.ssl-images-amazon.com/poster...",
    "title": "The Shawshank Redemption",
    "actors": [
      {"photo": "https://images-na.ssl-images-amazon.com/photo...",
      "name": "Tim Robbins"},
      {"photo": "https://images-na.ssl-images-amazon.com/photo...",
      "name": "Morgan Freeman"},...
    ]
  },{
    "poster": "https://images-na.ssl-images-amazon.com/poster...",
    "title": "The Godfather",
    "actors": [
      {"photo": "https://images-na.ssl-images-amazon.com/photo...",
      "name": "Marlon Brando"},
      {"photo": "https://images-na.ssl-images-amazon.com/photo...",
      "name": "Al Pacino"},...
    ]
  }]
}

in my items.py file I have the following:

import scrapy


class Top250ImdbItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()

    # Items from actors.py
    poster = scrapy.Field()
    title = scrapy.Field()
    photo = scrapy.Field()
    actors = scrapy.Field()
    movie = scrapy.Field()
    pass

I'm aware of the following things:

  1. My results are not coming out in order, the 1st movie on web page list is always the first movie on my output file but the rest is not. I'm still working on that.

  2. I can do the same thing but working with Top250ImdbItem(), still browsing around how that is done in a more detailed way.

  3. This might not be the perfect layout for my JSON, suggestions are welcomed or if it is, let me know, even though I know there is no perfect way or "the only way".

  4. Some actors don't have a photo and it actually loads a different CSS selector. For now, I would like to avoid reaching for the "no picture thumbnail" so it's ok to leave those items empty.

example:

{"photo": "", "name": "Al Pacino"}
2
  • Don't use (scrapy.Item) use dict and start with movies:[]. Commented Jul 18, 2017 at 13:07
  • Hey, @stovfl can you elaborate a little bit more. Commented Jul 19, 2017 at 1:16

1 Answer 1

1

Question: ... struggling with a JSON output file


Note: Can't use your ActorsSpider, get Error: Pseudo-elements are not supported.

# Define a `dict` **once**
top250ImdbItem = {'movies': []}

def parse_actor(self, response):
    poster = response.css(...
    title = response.css(...
    photos = response.css(...
    actors = response.css(...

    # Assuming List of Actors are in sync with List of Photos
    actors_list = []
    for i, actor in enumerate(actors):
        actors_list.append({"name": actor, "photo": photos[i]})

    one_movie = {"poster": poster,
                 "title": title,
                 "actors": actors_list
                }

    # Append One Movie to Top250 'movies' List
    top250ImdbItem['movies'].append(one_movie)
Sign up to request clarification or add additional context in comments.

3 Comments

Ok I'll check that, it's kinda wierd that you can't run it, I'm actually still using the exact same code, I'll check that problem also and update to see if you can run it,I'll try those suggestions and no actually the photos and actors are not in sync yet, still figuring out how to do it, but your help is actually great.
Should I post my modified working code as a comment here, edit the current one or just leave it as it is?
Edit your Question and add only the changed Part's

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.