I currently have a Python 2.7 script which scrapes Facebook and captures some JSON data from each page. The JSON data contains personal information. A sample of the JSON data is below:-
{
"id": "4",
"name": "Mark Zuckerberg",
"first_name": "Mark",
"last_name": "Zuckerberg",
"link": "http://www.facebook.com/zuck",
"username": "zuck",
"gender": "male",
"locale": "en_US"
}
The JSON values can vary from page to page. The above example lists all the possibles but sometimes, a value such as 'username' may not exist and I may encounter JSON data such as:-
{
"id": "6",
"name": "Billy Smith",
"first_name": "Billy",
"last_name": "Smith",
"gender": "male",
"locale": "en_US"
}
With this data, I want to populate a database table. As such, my code is as below:-
results_json = simplejson.loads(scraperwiki.scrape(profile_url))
for result in results_json:
profile = dict()
try:
profile['id'] = int(results_json['id'])
except:
profile['id'] = ""
try:
profile['name'] = results_json['name']
except:
profile['name'] = ""
try:
profile['first_name'] = results_json['first_name']
except:
profile['first_name'] = ""
try:
profile['last_name'] = results_json['last_name']
except:
profile['last_name'] = ""
try:
profile['link'] = results_json['link']
except:
profile['link'] = ""
try:
profile['username'] = results_json['username']
except:
profile['username'] = ""
try:
profile['gender'] = results_json['gender']
except:
profile['gender'] = ""
try:
profile['locale'] = results_json['locale']
except:
profile['locale'] = ""
The reason I have so many try/excepts is to account for when the key value doesn't exist on the webpage. Nonetheless, this seems to be a really clumpsy and messy way to handle this issue.
If I remove these try / exception clauses, should my scraper encounter a missing key, it returns a KeyError such as "KeyError: 'username'" and my script stops running.
Any suggestions on a much smarter and improved way to handle these errors so that, should a missing key be encountered, the script continues.
I've tried creating a list of the JSON values and looked to iterate through them with an IF clause but I just can't figure it out.