0

The code inserts wrong structure json into file

import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import json

urls = {}
urls['Av'] = {'Áa', 'Bb'}

data = {}
for key, value in urls.items(): 
    for x in value: 

        url = 'https://www.google.pt/search?q=' + key + '%20' + x
        driver = webdriver.Chrome()
        driver.get(url)
        html = driver.page_source

        soup = BeautifulSoup(html, 'html.parser')
        a = soup.find("body")

        for child in a.find_all("div", {'class': 'g'}):
            h2 = child.find("span", {'class': 'Q8LRLc'})
            div = child.find("a", {'class': 'Fx4vi'})

        data[key] = []
        data[key].append({'h2': h2, 'div': div})
        print(data)

        with open("data_file.json", "a") as write_file: 
            json.dump(data, write_file, indent=4)

        driver.quit()
8
  • 3
    Define "wrong structure", then proceed to define "good structure". Then give us example data that you would like to store and what it currently stores. If you have any errors, please post the full stack trace in a code block. Commented May 30, 2020 at 20:40
  • It output this: { "Av": [ { "h2": null, "div": null } ] }{ "Av": [ { "h2": null, "div": null } ] } Commented May 30, 2020 at 20:41
  • 3
    Please include this as an edit in the post in a code block, not as a comment Commented May 30, 2020 at 20:42
  • 1
    btw, be very careful with with open("data_file.json", "a"). This means that you are appending to the file, each time writing a new version of data. This will result in a technically invalid .json file. Did you mean to have this after the end of the for loop? Commented May 30, 2020 at 20:43
  • 1
    @Alvaro You still haven't updated your question instead of the comment section. Voting to close this question until more details have been added. Commented May 30, 2020 at 20:47

1 Answer 1

1

I see a bunch of issues, most are things either being inside a loop when they should be outside, or outside when they should be in.

  • You set your variables h2 and div inside the loop for child in a.find_all("div", {'class': 'g'}):, but you add them to data outside the loop, so only the last values will be added.
  • Additionally, you initialize the data for each key inside the loop, and it should be done outside, or it will be re-initialized each time.
  • You also open the file to append to it each time, I'd just do it once.
  • And, you initialize your driver in every loop.
  • requests and selenium.webdriver.chrome.options.Options are both unused imports

So, I'd change it like this:

urls = {}
urls['Av'] = {'Áa', 'Bb'}

data = {}
driver = webdriver.Chrome()
with open("data_file.json", "a") as write_file: 
    for key, value in urls.items():
        data[key] = []. # initialize only once per key

        for x in value: 
            url = 'https://www.google.pt/search?q=' + key + '%20' + x
            driver.get(url)
            html = driver.page_source
            soup = BeautifulSoup(html, 'html.parser')
            a = soup.find("body")

            for child in a.find_all("div", {'class': 'g'}):
                h2 = child.find("span", {'class': 'Q8LRLc'})
                div = child.find("a", {'class': 'Fx4vi'})
                data[key].append({'h2': h2, 'div': div})  # update data for every h2/div found

    json.dump(data, write_file, indent=4) # This write can be done once, outside all loops!

driver.quit()

A little hard for me to test, but hope that helps! Happy Coding!

Sign up to request clarification or add additional context in comments.

1 Comment

I think first you should load the JSON file and then update with fetched data and then write it to file with dump in the end.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.