Parsing JSON web scraper output

Question

I am practicing web scraping using the requests and BeautifulSoup modules on the following website:

https://www.imdb.com/title/tt0080684/

My code thus far properly outputs the json in question. I'd like help in extracting from the json only the name and description into a response dictionary.

Code

# Send HTTP requests
import requests

import json

from bs4 import BeautifulSoup


class WebScraper:

    def send_http_request():

        # Obtain the URL via user input
        url = input('Input the URL:\n')

        # Get the webpage
        r = requests.get(url)

        soup = BeautifulSoup(r.content, 'html.parser')

        # Check response object's status code
        if r:
            p = json.loads("".join(soup.find('script', {'type':'application/ld+json'}).contents))
            print(p)
        else:
            print('\nInvalid movie page!')


WebScraper.send_http_request()

Desired Output

{"title": "Star Wars: Episode V - The Empire Strikes Back", "description": "After the Rebels are brutally overpowered by the Empire on the ice planet Hoth, Luke Skywalker begins Jedi training with Yoda, while his friends are pursued by Darth Vader and a bounty hunter named Boba Fett all over the galaxy."}

GAP2002 · Accepted Answer · 2021-03-04 20:39:23Z

You can parse the dictonary and then print a new JSON object using the dumps method:

# Send HTTP requests
import requests

import json

from bs4 import BeautifulSoup


class WebScraper:

    def send_http_request():

        # Obtain the URL via user input
        url = input('Input the URL:\n')

        # Get the webpage
        r = requests.get(url)

        soup = BeautifulSoup(r.content, 'html.parser')

        # Check response object's status code
        if r:
            p = json.loads("".join(soup.find('script', {'type':'application/ld+json'}).contents))
            output = json.dumps({"title": p["name"], "description": p["description"]})
            print(output)
        else:
            print('\nInvalid movie page!')


WebScraper.send_http_request()

Output:

{"title": "Star Wars: Episode V - The Empire Strikes Back", "description": "Star Wars: Episode V - The Empire Strikes Back is a movie starring Mark Hamill, Harrison Ford, and Carrie Fisher. After the Rebels are brutally overpowered by the Empire on the ice planet Hoth, Luke Skywalker begins Jedi training..."}

Thang Pham · Accepted Answer · 2021-03-04 20:52:29Z

1

You just need to create a new dictionary from p given 2 keys name and description.

        # Check response object's status code
        if r:
            p = json.loads("".join(soup.find('script', {'type':'application/ld+json'}).contents))
            desired_output = {"title": p["name"], "description": p["description"]}
            print(desired_output)
        else:
            print('\nInvalid movie page!')

Output:

{'title': 'Star Wars: Episode V - The Empire Strikes Back', 'description': 'Star Wars: Episode V - The Empire Strikes Back is a movie starring Mark Hamill, Harrison Ford, and Carrie Fisher. After the Rebels are brutally overpowered by the Empire on the ice planet Hoth, Luke Skywalker begins Jedi training...'}

edited Mar 4, 2021 at 20:52

answered Mar 4, 2021 at 20:36

Thang Pham

1,0262 gold badges9 silver badges20 bronze badges

2 Comments

GAP2002 Over a year ago

p is already type dict so there is no need to use the dict() function!

Thang Pham Over a year ago

Updated. Thanks!

Collectives™ on Stack Overflow

Parsing JSON web scraper output

2 Answers 2

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related