How do I access embedded json objects in a Pandas DataFrame?

Question

TL;DR If loaded fields in a Pandas DataFrame contain JSON documents themselves, how can they be worked with in a Pandas like fashion?

Currently I'm directly dumping json/dictionary results from a Twitter library (twython) into a Mongo collection (called users here).

from twython import Twython
from pymongo import MongoClient

tw = Twython(...<auth>...)

# Using mongo as object storage 
client = MongoClient()
db = client.twitter
user_coll = db.users

user_batch = ... # collection of user ids
user_dict_batch = tw.lookup_user(user_id=user_batch)

for user_dict in user_dict_batch:
    if(user_coll.find_one({"id":user_dict['id']}) == None):
        user_coll.insert(user_dict)

After populating this database I read the documents into Pandas:

# Pull straight from mongo to pandas
cursor = user_coll.find()
df = pandas.DataFrame(list(cursor))

Which works like magic:

Pandas is magic

I'd like to be able to mangle the 'status' field Pandas style (directly accessing attributes). Is there a way?

status field

EDIT: Something like df['status:text']. Status has fields like 'text', 'created_at'. One option could be flattening/normalizing this json field like this pull request Wes McKinney was working on.

Can you give an example of what you actually want to do? You showed the df['status'] column, but what do you want to do with it? — BrenBarn
– BrenBarn, Commented Sep 6, 2013 at 19:54
FWIW There's a PR in the works for this: github.com/pydata/pandas/pull/4007 — Phillip Cloud
– Phillip Cloud, Commented Sep 6, 2013 at 19:55
@BrenBarn - I was hoping to be able to select within those fields, somewhat like df[df['status']['favorited'] == False]. — Kyle Kelley
– Kyle Kelley, Commented Sep 6, 2013 at 20:09
@PhillipCloud - Good to see that PR! Additionally, looks like someone else was doing the same type of thing with the Twitter API in this issue: github.com/pydata/pandas/issues/1067. — Kyle Kelley
– Kyle Kelley, Commented Sep 6, 2013 at 20:12

Andy Hayden · Accepted Answer · 2013-09-06 20:32:29Z

22

One solution is just to smash it with the Series constructor:

In [1]: df = pd.DataFrame([[1, {'a': 2}], [2, {'a': 1, 'b': 3}]])

In [2]: df
Out[2]: 
   0                   1
0  1           {u'a': 2}
1  2  {u'a': 1, u'b': 3}

In [3]: df[1].apply(pd.Series)
Out[3]: 
   a   b
0  2 NaN
1  1   3

In some cases you'll want to concat this to the DataFrame in place of the dict row:

In [4]: dict_col = df.pop(1)  # here 1 is the column name

In [5]: pd.concat([df, dict_col.apply(pd.Series)], axis=1)
Out[5]: 
   0  a   b
0  1  2 NaN
1  2  1   3

If the it goes deeper, you can do this a few times...

answered Sep 6, 2013 at 20:32

Andy Hayden

378k110 gold badges640 silver badges546 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Kyle Kelley Over a year ago

Rad. This worked well, so long as there weren't null entries.

Kyle Kelley Over a year ago

Also needed to merge the statuses in, adding a suffix so name collisions got decent names. df2 = df[df.status.notnull()] statuses = df2.status.apply(pandas.Series) df2 = df2.merge(statuses, left_index=True, right_index=True,suffixes=("","_status"))

Andy Hayden Over a year ago

dang, that's annoying you have to special case NaN, another solution for that part is to fillna({}) first.

Kyle Kelley Over a year ago

Oh that would have worked too, but I didn't need the empty results in this case.

Collectives™ on Stack Overflow

How do I access embedded json objects in a Pandas DataFrame?

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related