I'm trying to do some work with real estate data and after failing on my own managed to borrow a code that pulled some of the data. Unfortunately I have no idea how to parse the rest, as the json formatting is very confusing to me. This is not my area of expertise so if anyone has ideas on how to approach this I would greatly appreciate it. If needed I can post the entire json but it's very long.
import requests
from bs4 import BeautifulSoup
import json
import pandas as pd
import pprint
#-------------------------------------------------------------------------------------------------------------------------#
HEADERS = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome 61.0.3163.100 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'upgrade-insecure-requests': '1'
}
#-------------------------------------------------------------------------------------------------------------------------#
def get_soup(address):
page_request = requests.get(address, headers=HEADERS)
return BeautifulSoup(page_request.text, "lxml")
#-------------------------------------------------------------------------------------------------------------------------#
def fetch_content(soup, verbose=False):
item = soup.select_one("script#hdpApolloPreloadedData").text
d = json.loads(item)['apiCache']
return json.loads(d)
#-------------------------------------------------------------------------------------------------------------------------#
def process_fetched_content(raw_dictionary=None):
if raw_dictionary is not None:
keys = [k for k in raw_dictionary.keys() if k.startswith('VariantQuery{"zpid":')]
property_info = dict((k.split(':')[-1].replace('}',''), raw_dictionary.get(k).get('property', None)) for k in keys)
return property_info
else:
return None
#-------------------------------------------------------------------------------------------------------------------------#
if __name__ == "__main__":
link = 'https://www.zillow.com/homedetails/2408-Comstock-Ct-Naperville-IL-60564/5367006_zpid/'
soup = get_soup(link)
results = process_fetched_content(raw_dictionary = fetch_content(soup, verbose=False))
pprint.pprint(results)
Sidenote: I know zillow doesn't take kindly to scraping but I'm not trying to pull data at a large scale so not too concerned.
itemvariable that containsjsonin thefetch_content()function? You seem to be decoding just fine withd = json.loads(..., except you don't want to repeat that. Did you meanreturn d?