Python: Appending adding to a dictionary to with a for loop to output as json

Question

I'm bit new to python, I've trying to scrap a page using Beautiful Soup and output the results in a JSON format. SimpleJson

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from bs4 import BeautifulSoup
import json as simplejson 

webpages = (
    "page1.html",
    "page2.html",
    "page3.html"
)

my_dict = {}

for webpage in webpages:
    soup = BeautifulSoup(open(webpage))
    title = soup.title.string
    body = soup.find(id="bodyText")
    my_dict['title'] = title
    my_dict['body']= str(body)

print simplejson.dumps(my_dict,indent=4)

I'm only getting the results of the last page? Can someone tell me where I'm going wrong?

Gillespie · Accepted Answer · 2014-12-22 14:43:32Z

3

You are overwriting your dictionary each time through the loop. Tab the print statement over so it is included in the for loop:

for webpage in webpages:

    soup = BeautifulSoup(open(webpage))

    title = soup.title.string

    body = soup.find(id="bodyText")

    my_dict['title'] = title
    my_dict['body']= str(body)

    print simplejson.dumps(my_dict,indent=4)

answered Dec 22, 2014 at 14:43

Gillespie

6,6463 gold badges38 silver badges71 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

FrobberOfBits Over a year ago

Yup; OP, you're destroying title and body each time through the loop.

Martin Wright Over a year ago

Thanks, I was actually trying to get it in to one dictionary, but I now realize that the output is the same anyway!

Gillespie Over a year ago

If you want to put them all in one dictionary, one way is to make a dictionary of dictionaries, where the key for the subdictionary is the name of the page.

m.wasowski · Accepted Answer · 2014-12-22 15:13:00Z

results = [] # you need a list to collect all dictionaries

for webpage in webpages:

    soup = BeautifulSoup(open(webpage))
    this_dict = {}
    this_dict['title'] = soup.title.string
    this_dict['body'] = soup.find(id="bodyText")
    results.append(this_dict)

print simplejson.dumps(results, indent=4)

I have a feeling, however, that what you want it is a dictionary, where keys are titles of page and values are bodies:

results = {}

for webpage in webpages:
    soup = BeautifulSoup(open(webpage))
    results[soup.title.string] = soup.find(id='bodyText')

print simplejson.dumps(results, indent=4)

Or using comprehensions:

soups = (BeautifulSoup(open(webpage)) for webpage in webpages)
results = {soup.title.string: soup.find(id='bodyText') for soup in soups}
print simplejson.dumps(results, indent=4)

PS. Please forgive me mistakes, if any occur, I am writing from a phone...

Irshad Bhat · Accepted Answer · 2014-12-22 15:11:57Z

Since you are destroying title and body in each iteration, there are two ways of handling it:

Create a list of all dictionaries as:

all_dict=[]
for webpage in webpages:
    soup = BeautifulSoup(open(webpage))
    title = soup.title.string
    body = soup.find(id="bodyText")
    my_dict['title'] = title
    my_dict['body']= str(body)
    all_dict.append(my_dict)

for my_dict in alldict:
    print simplejson.dumps(my_dict,indent=4)

Use iteration number using enumerate() to create different title and body names like title1, body1, title2, body2, etc. This way you preserve each title and body name in same dictionary as:

for i,webpage in enumerate(webpages):
    soup = BeautifulSoup(open(webpage))
    title = soup.title.string
    body = soup.find(id="bodyText")
    my_dict['title'+str(i)] = title
    my_dict['body'+str(i)]= str(body)

print simplejson.dumps(my_dict,indent=4)

ZdaR · Accepted Answer · 2014-12-22 15:04:47Z

-2

An indentation can cause wonders in python , only the last line needed to be indented inside the for loop

from bs4 import BeautifulSoup
import json as simplejson 

webpages = (

"page1.html",
"page2.html",
"page3.html"

)

my_dict = {}

for webpage in webpages:

    soup = BeautifulSoup(open(webpage))

    title = soup.title.string

    body = soup.find(id="bodyText")

    my_dict['title'] = title
    my_dict['body']= str(body)

    print simplejson.dumps(my_dict,indent=4)

or if you really want all the data in one dictioanry, then you could try:

my_dict['title'] = my_dict.get("title","")+","+title
my_dict['body']= my_dict.get("body","")+","+body

So the code may look like:

from bs4 import BeautifulSoup
import json as simplejson 

webpages = (

"page1.html",
"page2.html",
"page3.html"

)

my_dict = {}

for webpage in webpages:

    soup = BeautifulSoup(open(webpage))

    title = soup.title.string

    body = soup.find(id="bodyText")

    my_dict['title'] = my_dict.get("title",[]).append(title)
    my_dict['body']= my_dict.get("body",[]).append(body)

print simplejson.dumps(my_dict,indent=4)

edited Dec 22, 2014 at 15:04

answered Dec 22, 2014 at 14:45

ZdaR

23.1k7 gold badges71 silver badges90 bronze badges

5 Comments

m.wasowski Over a year ago

what if, for example, 'title' tag contains a coma?

ZdaR Over a year ago

The delimiter should be specified by the user , It can be modified accordingly

ZdaR Over a year ago

Because the title said "appending" , That's why I came up with this suggestion

Martin Wright Over a year ago

Yes, I trying get it into on dictionary. But I've now realised that the output is the same anyway! thanks for you help.

m.wasowski Over a year ago

delimiter for JSON is specified. If title and body should be lists, you he should append have his data structure initialized as my_dict = {'title': [], 'body': []} and do my_dict['title'].append(title) in each iteration

Collectives™ on Stack Overflow

Python: Appending adding to a dictionary to with a for loop to output as json

4 Answers 4

3 Comments

Comments

Comments

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

3 Comments

Comments

Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related