1

I'm bit new to python, I've trying to scrap a page using Beautiful Soup and output the results in a JSON format. SimpleJson

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from bs4 import BeautifulSoup
import json as simplejson 

webpages = (
    "page1.html",
    "page2.html",
    "page3.html"
)

my_dict = {}

for webpage in webpages:
    soup = BeautifulSoup(open(webpage))
    title = soup.title.string
    body = soup.find(id="bodyText")
    my_dict['title'] = title
    my_dict['body']= str(body)

print simplejson.dumps(my_dict,indent=4)

I'm only getting the results of the last page? Can someone tell me where I'm going wrong?

4 Answers 4

3

You are overwriting your dictionary each time through the loop. Tab the print statement over so it is included in the for loop:

for webpage in webpages:

    soup = BeautifulSoup(open(webpage))

    title = soup.title.string

    body = soup.find(id="bodyText")

    my_dict['title'] = title
    my_dict['body']= str(body)

    print simplejson.dumps(my_dict,indent=4)
Sign up to request clarification or add additional context in comments.

3 Comments

Yup; OP, you're destroying title and body each time through the loop.
Thanks, I was actually trying to get it in to one dictionary, but I now realize that the output is the same anyway!
If you want to put them all in one dictionary, one way is to make a dictionary of dictionaries, where the key for the subdictionary is the name of the page.
1
results = [] # you need a list to collect all dictionaries

for webpage in webpages:

    soup = BeautifulSoup(open(webpage))
    this_dict = {}
    this_dict['title'] = soup.title.string
    this_dict['body'] = soup.find(id="bodyText")
    results.append(this_dict)

print simplejson.dumps(results, indent=4)

I have a feeling, however, that what you want it is a dictionary, where keys are titles of page and values are bodies:

results = {}

for webpage in webpages:
    soup = BeautifulSoup(open(webpage))
    results[soup.title.string] = soup.find(id='bodyText')

print simplejson.dumps(results, indent=4)

Or using comprehensions:

soups = (BeautifulSoup(open(webpage)) for webpage in webpages)
results = {soup.title.string: soup.find(id='bodyText') for soup in soups}
print simplejson.dumps(results, indent=4)

PS. Please forgive me mistakes, if any occur, I am writing from a phone...

Comments

0

Since you are destroying title and body in each iteration, there are two ways of handling it:

  1. Create a list of all dictionaries as:

    all_dict=[]
    for webpage in webpages:
        soup = BeautifulSoup(open(webpage))
        title = soup.title.string
        body = soup.find(id="bodyText")
        my_dict['title'] = title
        my_dict['body']= str(body)
        all_dict.append(my_dict)
    
    for my_dict in alldict:
        print simplejson.dumps(my_dict,indent=4)
    
  2. Use iteration number using enumerate() to create different title and body names like title1, body1, title2, body2, etc. This way you preserve each title and body name in same dictionary as:

    for i,webpage in enumerate(webpages):
        soup = BeautifulSoup(open(webpage))
        title = soup.title.string
        body = soup.find(id="bodyText")
        my_dict['title'+str(i)] = title
        my_dict['body'+str(i)]= str(body)
    
    print simplejson.dumps(my_dict,indent=4)
    

Comments

-2

An indentation can cause wonders in python , only the last line needed to be indented inside the for loop

from bs4 import BeautifulSoup
import json as simplejson 

webpages = (

"page1.html",
"page2.html",
"page3.html"

)

my_dict = {}

for webpage in webpages:

    soup = BeautifulSoup(open(webpage))

    title = soup.title.string

    body = soup.find(id="bodyText")

    my_dict['title'] = title
    my_dict['body']= str(body)

    print simplejson.dumps(my_dict,indent=4)

or if you really want all the data in one dictioanry, then you could try:

my_dict['title'] = my_dict.get("title","")+","+title
my_dict['body']= my_dict.get("body","")+","+body

So the code may look like:

from bs4 import BeautifulSoup
import json as simplejson 

webpages = (

"page1.html",
"page2.html",
"page3.html"

)

my_dict = {}

for webpage in webpages:

    soup = BeautifulSoup(open(webpage))

    title = soup.title.string

    body = soup.find(id="bodyText")

    my_dict['title'] = my_dict.get("title",[]).append(title)
    my_dict['body']= my_dict.get("body",[]).append(body)

print simplejson.dumps(my_dict,indent=4)

5 Comments

what if, for example, 'title' tag contains a coma?
The delimiter should be specified by the user , It can be modified accordingly
Because the title said "appending" , That's why I came up with this suggestion
Yes, I trying get it into on dictionary. But I've now realised that the output is the same anyway! thanks for you help.
delimiter for JSON is specified. If title and body should be lists, you he should append have his data structure initialized as my_dict = {'title': [], 'body': []} and do my_dict['title'].append(title) in each iteration

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.