4

I currently have a Python 2.7 script which scrapes Facebook and captures some JSON data from each page. The JSON data contains personal information. A sample of the JSON data is below:-

{
   "id": "4",
   "name": "Mark Zuckerberg",
   "first_name": "Mark",
   "last_name": "Zuckerberg",
   "link": "http://www.facebook.com/zuck",
   "username": "zuck",
   "gender": "male",
   "locale": "en_US"
}

The JSON values can vary from page to page. The above example lists all the possibles but sometimes, a value such as 'username' may not exist and I may encounter JSON data such as:-

{
   "id": "6",
   "name": "Billy Smith",
   "first_name": "Billy",
   "last_name": "Smith",
   "gender": "male",
   "locale": "en_US"
}

With this data, I want to populate a database table. As such, my code is as below:-

results_json = simplejson.loads(scraperwiki.scrape(profile_url))
        for result in results_json:
            profile = dict()
            try:
                profile['id'] = int(results_json['id'])
            except:
                profile['id'] = ""
            try:
                profile['name'] = results_json['name']
            except:
                profile['name'] = ""
            try:
                profile['first_name'] = results_json['first_name']
            except:
                profile['first_name'] = ""
            try:
                profile['last_name'] = results_json['last_name']
            except:
                profile['last_name'] = ""
            try:
                profile['link'] = results_json['link']
            except:
                profile['link'] = ""
            try:
                profile['username'] = results_json['username']
            except:
                profile['username'] = ""
            try:
                profile['gender'] = results_json['gender']
            except:
                profile['gender'] = ""
            try:
                profile['locale'] = results_json['locale']
            except:
                profile['locale'] = ""

The reason I have so many try/excepts is to account for when the key value doesn't exist on the webpage. Nonetheless, this seems to be a really clumpsy and messy way to handle this issue.

If I remove these try / exception clauses, should my scraper encounter a missing key, it returns a KeyError such as "KeyError: 'username'" and my script stops running.

Any suggestions on a much smarter and improved way to handle these errors so that, should a missing key be encountered, the script continues.

I've tried creating a list of the JSON values and looked to iterate through them with an IF clause but I just can't figure it out.

1 Answer 1

10

Use the .get() method instead:

>>> a = {'bar': 'eggs'}
>>> a['foo']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'foo'
>>> a.get('foo', 'default value')
'default value'
>>> a.get('bar', 'default value')
'eggs'

The .get() method returns the value for the requested key, or the default value if the key is missing.

Or you can create a new dict with empty strings for each key and use .update() on it:

profile = dict.fromkeys('id name first_name last_name link username gender locale'.split(), '')
profile.update(result)

dict.fromkeys() creates a dictionary with all keys you request set to a given default value ('' in the above example), then we use .update() to copy all keys and values from the result dictionary, replacing anything already there.

Sign up to request clarification or add additional context in comments.

6 Comments

Sadly, as much as I love the fromkeys/update solution, I don't think it works in the OP's case. At least one of the values has to be transformed with int, and the JSON object may have extra keys that he doesn't want copied into the profile. But the get answer you also gave obviously doesn't have any such problems, and it's already enough for the OP to make his code 4x more readable and maintainable, which is a pretty huge win.
@abarnert: the .fromkeys()/.update() setup works fine, because nothing in the source JSON is using int either. Don't assume that the id is to be treated as an int unless explicitly stated. :-) If there are too many keys it's easy enough to write a generator expression for the update (.update() takes tuples as well as a dict).
In the OP's code, he's got profile['id'] = int(results_json['id']). So, presumably, the JSON has the int rendered a string, but the profile has to have an actual integer.
@MartijnPieters. Perfect - your solution worked as I wishes and it's reduced my cost by 50%. Thanks
@abarnert: Hrm, not certain how I missed that one. :) Still, a profile['id'] = int(profile['id']) isn't that hard to add. :-P
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.