Add column of DataFrame based on a nested JSON in column

Question

I have the following Pandas Dataframe (named dx):

                                    _id  user_type
0  {'$oid': '5b9058462f38434ab0d85cd3'}          1
1  {'$oid': '5b9058462f38434ab0d85ce9'}          1
2  {'$oid': '5b9058462f38434ab0d85ced'}          1
3  {'$oid': '5b9058462f38434ab0d85cee'}          1
4  {'$oid': '5b9058462f38434ab0d85cef'}          1

It came from a big JSON (more or less 500.000 rows and more columns then this example), as you can see the _id contain a nested JSON.

For this example, my target is to get a new column named oid with the oid code:

                                    _id  user_type                       oid
0  {'$oid': '5b9058462f38434ab0d85cd3'}          1  5b9058462f38434ab0d85cd3
1  {'$oid': '5b9058462f38434ab0d85ce9'}          1  5b9058462f38434ab0d85ce9
2  {'$oid': '5b9058462f38434ab0d85ced'}          1  5b9058462f38434ab0d85ced
3  {'$oid': '5b9058462f38434ab0d85cee'}          1  5b9058462f38434ab0d85cee
4  {'$oid': '5b9058462f38434ab0d85cef'}          1  5b9058462f38434ab0d85cef

I achieved the results using the following snippet:

dx['oid']=None
for i in dx.index:
     dx['oid'][i]=dx.at[i,'_id']['$oid']

This give me what I am looking for but it is very very slow. It takes 3-4 minutes to populate the column of the example, which has only 5 rows!

How can I optimize the creation of a new column based on other columns with JSON values?

I can't use regex because nested JSON is core complex then the one in the example.

BENY · Accepted Answer · 2018-12-27 22:39:31Z

3

This is object column type which contain single dict within the cell so

df['$oid']=df['_id'].map(lambda x : x['$oid'])

Or

s=df['_id'].apply(pd.Series)
s
#df=pd.concat([df,s],axis=1)
Out[601]: 
                       $oid
0  5b9058462f38434ab0d85cd3
1  5b9058462f38434ab0d85ce9
2  5b9058462f38434ab0d85ced
3  5b9058462f38434ab0d85cee
4  5b9058462f38434ab0d85cef

answered Dec 27, 2018 at 22:39

BENY

324k22 gold badges176 silver badges250 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

jpp · Accepted Answer · 2018-12-27 23:01:14Z

2

You can use operator.itemgetter. No vectorised solution is possible since your input series has object dtype.

from operator import itemgetter

field_name = '$oid'
df[field_name] = list(map(itemgetter(field_name), df['_id']))

More generalised solutions are available, see Splitting dictionary/list inside a Pandas Column into Separate Columns.

answered Dec 27, 2018 at 23:01

jpp

166k37 gold badges301 silver badges363 bronze badges

Collectives™ on Stack Overflow

Add column of DataFrame based on a nested JSON in column

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related