Scrapy - Creating nested JSON Object

Question

I'm learning how to work with Scrapy while refreshing my knowledge in Python?/Coding from school.

Currently, I'm playing around with imdb top 250 list but struggling with a JSON output file.

My current code is:

 # -*- coding: utf-8 -*-
import scrapy

from top250imdb.items import Top250ImdbItem


class ActorsSpider(scrapy.Spider):
    name = "actors"
    allowed_domains = ["imdb.com"]
    start_urls = ['http://www.imdb.com/chart/top']

    # Parsing each movie and preparing the url for the actors list
    def parse(self, response):
        for film in response.css('.titleColumn'):
            url = film.css('a::attr(href)').extract_first()
            actors_url = 'http://imdb.com' + url[:17] + 'fullcredits?ref_=tt_cl_sm#cast'
            yield scrapy.Request(actors_url, self.parse_actor)

    # Finding all actors and storing them on item
    # Refer to items.py
    def parse_actor(self, response):
        final_list = []
        item = Top250ImdbItem()
        item['poster'] = response.css('#main img::attr(src)').extract_first()
        item['title'] = response.css('h3[itemprop~=name] a::text').extract()
        item['photo'] = response.css('#fullcredits_content .loadlate::attr(loadlate)').extract()
        item['actors'] = response.css('td[itemprop~=actor] span::text').extract()

        final_list.append(item)

        updated_list = []

        for item in final_list:
            for i in range(len(item['title'])):
                sub_item = {}
                sub_item['movie'] = {}
                sub_item['movie']['poster'] = [item['poster']]
                sub_item['movie']['title'] = [item['title'][i]]
                sub_item['movie']['photo'] = [item['photo']]
                sub_item['movie']['actors'] = [item['actors']]
                updated_list.append(sub_item)
            return updated_list

and my output file is giving me this JSON composition:

[
  {
    "movie": {
      "poster": ["https://images-na.ssl-images-amazon.com/poster..."], 
      "title": ["The Shawshank Redemption"], 
      "photo": [["https://images-na.ssl-images-amazon.com/photo..."]], 
      "actors": [["Tim Robbins","Morgan Freeman",...]]}
    },{
    "movie": {
      "poster": ["https://images-na.ssl-images-amazon.com/poster..."], 
      "title": ["The Godfather"], 
      "photo": [["https://images-na.ssl-images-amazon.com/photo..."]], 
      "actors": [["Alexandre Rodrigues", "Leandro Firmino", "Phellipe Haagensen",...]]}
  }
]

but I'm looking to achieve this:

{
  "movies": [{
    "poster": "https://images-na.ssl-images-amazon.com/poster...",
    "title": "The Shawshank Redemption",
    "actors": [
      {"photo": "https://images-na.ssl-images-amazon.com/photo...",
      "name": "Tim Robbins"},
      {"photo": "https://images-na.ssl-images-amazon.com/photo...",
      "name": "Morgan Freeman"},...
    ]
  },{
    "poster": "https://images-na.ssl-images-amazon.com/poster...",
    "title": "The Godfather",
    "actors": [
      {"photo": "https://images-na.ssl-images-amazon.com/photo...",
      "name": "Marlon Brando"},
      {"photo": "https://images-na.ssl-images-amazon.com/photo...",
      "name": "Al Pacino"},...
    ]
  }]
}

in my items.py file I have the following:

import scrapy


class Top250ImdbItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()

    # Items from actors.py
    poster = scrapy.Field()
    title = scrapy.Field()
    photo = scrapy.Field()
    actors = scrapy.Field()
    movie = scrapy.Field()
    pass

I'm aware of the following things:

My results are not coming out in order, the 1st movie on web page list is always the first movie on my output file but the rest is not. I'm still working on that.
I can do the same thing but working with Top250ImdbItem(), still browsing around how that is done in a more detailed way.
This might not be the perfect layout for my JSON, suggestions are welcomed or if it is, let me know, even though I know there is no perfect way or "the only way".
Some actors don't have a photo and it actually loads a different CSS selector. For now, I would like to avoid reaching for the "no picture thumbnail" so it's ok to leave those items empty.

example:

{"photo": "", "name": "Al Pacino"}

Don't use (scrapy.Item) use dict and start with movies:[]. — stovfl
– stovfl, Commented Jul 18, 2017 at 13:07

stovfl · Accepted Answer · 2017-07-19 17:11:26Z

1

Question: ... struggling with a JSON output file

Note: Can't use your ActorsSpider, get Error: Pseudo-elements are not supported.

# Define a `dict` **once**
top250ImdbItem = {'movies': []}

def parse_actor(self, response):
    poster = response.css(...
    title = response.css(...
    photos = response.css(...
    actors = response.css(...

    # Assuming List of Actors are in sync with List of Photos
    actors_list = []
    for i, actor in enumerate(actors):
        actors_list.append({"name": actor, "photo": photos[i]})

    one_movie = {"poster": poster,
                 "title": title,
                 "actors": actors_list
                }

    # Append One Movie to Top250 'movies' List
    top250ImdbItem['movies'].append(one_movie)

answered Jul 19, 2017 at 17:11

stovfl

15.6k7 gold badges26 silver badges54 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

ricardoNava Over a year ago

Ok I'll check that, it's kinda wierd that you can't run it, I'm actually still using the exact same code, I'll check that problem also and update to see if you can run it,I'll try those suggestions and no actually the photos and actors are not in sync yet, still figuring out how to do it, but your help is actually great.

ricardoNava Over a year ago

Should I post my modified working code as a comment here, edit the current one or just leave it as it is?

stovfl Over a year ago

Edit your Question and add only the changed Part's

Collectives™ on Stack Overflow

Scrapy - Creating nested JSON Object

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related