16

TL;DR If loaded fields in a Pandas DataFrame contain JSON documents themselves, how can they be worked with in a Pandas like fashion?

Currently I'm directly dumping json/dictionary results from a Twitter library (twython) into a Mongo collection (called users here).

from twython import Twython
from pymongo import MongoClient

tw = Twython(...<auth>...)

# Using mongo as object storage 
client = MongoClient()
db = client.twitter
user_coll = db.users

user_batch = ... # collection of user ids
user_dict_batch = tw.lookup_user(user_id=user_batch)

for user_dict in user_dict_batch:
    if(user_coll.find_one({"id":user_dict['id']}) == None):
        user_coll.insert(user_dict)

After populating this database I read the documents into Pandas:

# Pull straight from mongo to pandas
cursor = user_coll.find()
df = pandas.DataFrame(list(cursor))

Which works like magic:

Pandas is magic

I'd like to be able to mangle the 'status' field Pandas style (directly accessing attributes). Is there a way?

status field

EDIT: Something like df['status:text']. Status has fields like 'text', 'created_at'. One option could be flattening/normalizing this json field like this pull request Wes McKinney was working on.

7
  • Can you give an example of what you actually want to do? You showed the df['status'] column, but what do you want to do with it? Commented Sep 6, 2013 at 19:54
  • FWIW There's a PR in the works for this: github.com/pydata/pandas/pull/4007 Commented Sep 6, 2013 at 19:55
  • Are there nested records in the elements of df.status? Commented Sep 6, 2013 at 20:02
  • @BrenBarn - I was hoping to be able to select within those fields, somewhat like df[df['status']['favorited'] == False]. Commented Sep 6, 2013 at 20:09
  • @PhillipCloud - Good to see that PR! Additionally, looks like someone else was doing the same type of thing with the Twitter API in this issue: github.com/pydata/pandas/issues/1067. Commented Sep 6, 2013 at 20:12

1 Answer 1

22

One solution is just to smash it with the Series constructor:

In [1]: df = pd.DataFrame([[1, {'a': 2}], [2, {'a': 1, 'b': 3}]])

In [2]: df
Out[2]: 
   0                   1
0  1           {u'a': 2}
1  2  {u'a': 1, u'b': 3}

In [3]: df[1].apply(pd.Series)
Out[3]: 
   a   b
0  2 NaN
1  1   3

In some cases you'll want to concat this to the DataFrame in place of the dict row:

In [4]: dict_col = df.pop(1)  # here 1 is the column name

In [5]: pd.concat([df, dict_col.apply(pd.Series)], axis=1)
Out[5]: 
   0  a   b
0  1  2 NaN
1  2  1   3

If the it goes deeper, you can do this a few times...

Sign up to request clarification or add additional context in comments.

4 Comments

Rad. This worked well, so long as there weren't null entries.
Also needed to merge the statuses in, adding a suffix so name collisions got decent names. df2 = df[df.status.notnull()] statuses = df2.status.apply(pandas.Series) df2 = df2.merge(statuses, left_index=True, right_index=True,suffixes=("","_status"))
dang, that's annoying you have to special case NaN, another solution for that part is to fillna({}) first.
Oh that would have worked too, but I didn't need the empty results in this case.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.