2

I have the following Pandas Dataframe (named dx):

                                    _id  user_type
0  {'$oid': '5b9058462f38434ab0d85cd3'}          1
1  {'$oid': '5b9058462f38434ab0d85ce9'}          1
2  {'$oid': '5b9058462f38434ab0d85ced'}          1
3  {'$oid': '5b9058462f38434ab0d85cee'}          1
4  {'$oid': '5b9058462f38434ab0d85cef'}          1

It came from a big JSON (more or less 500.000 rows and more columns then this example), as you can see the _id contain a nested JSON.

For this example, my target is to get a new column named oid with the oid code:

                                    _id  user_type                       oid
0  {'$oid': '5b9058462f38434ab0d85cd3'}          1  5b9058462f38434ab0d85cd3
1  {'$oid': '5b9058462f38434ab0d85ce9'}          1  5b9058462f38434ab0d85ce9
2  {'$oid': '5b9058462f38434ab0d85ced'}          1  5b9058462f38434ab0d85ced
3  {'$oid': '5b9058462f38434ab0d85cee'}          1  5b9058462f38434ab0d85cee
4  {'$oid': '5b9058462f38434ab0d85cef'}          1  5b9058462f38434ab0d85cef

I achieved the results using the following snippet:

dx['oid']=None
for i in dx.index:
     dx['oid'][i]=dx.at[i,'_id']['$oid']

This give me what I am looking for but it is very very slow. It takes 3-4 minutes to populate the column of the example, which has only 5 rows!

How can I optimize the creation of a new column based on other columns with JSON values?

I can't use regex because nested JSON is core complex then the one in the example.

2 Answers 2

3

This is object column type which contain single dict within the cell so

df['$oid']=df['_id'].map(lambda x : x['$oid'])

Or

s=df['_id'].apply(pd.Series)
s
#df=pd.concat([df,s],axis=1)
Out[601]: 
                       $oid
0  5b9058462f38434ab0d85cd3
1  5b9058462f38434ab0d85ce9
2  5b9058462f38434ab0d85ced
3  5b9058462f38434ab0d85cee
4  5b9058462f38434ab0d85cef
Sign up to request clarification or add additional context in comments.

Comments

2

You can use operator.itemgetter. No vectorised solution is possible since your input series has object dtype.

from operator import itemgetter

field_name = '$oid'
df[field_name] = list(map(itemgetter(field_name), df['_id']))

More generalised solutions are available, see Splitting dictionary/list inside a Pandas Column into Separate Columns.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.