1

New to Python and Pandas, working on getting the hang of jsons. Any help appreciated.

Via an API I'm pulling a nested json. The structure of the json is below. The fields I'm after are in view, labeled user_id and message, and then under the nested field replies the subfields user_id and message. The desired fields are labeled below with <<<

  ],
  "view": [
    {
      "id": 109205,
      "user_id": 6354, # <<<< this field
      "parent_id": null,
      "created_at": "2020-11-03T23:32:49Z",
      "updated_at": "2020-11-03T23:32:49Z",
      "rating_count": null,
      "rating_sum": null,
      "message": "message text1", # <<< this field
      "replies": [
        {
          "id": 109298,
          "user_id": 5457, # <<< this field
          "parent_id": 109205,
          "created_at": "2020-11-04T19:42:59Z",
          "updated_at": "2020-11-04T19:42:59Z",
          "rating_count": null,
          "rating_sum": null,
          "message": "message text2" # <<< this field
        },
        {
         #json continues

I can successfully pull the top level fields under view, but I'm having difficulty flattening the nested json field replies with json_normalize. Here's my working code:

import pandas as pd

d = r.json() # json pulled from API

df = pd.json_normalize(d['view'], record_path=['replies'])

print(df)

Which results in the following KeyError:

Traceback (most recent call last):
  File "C:\Users\danie\AppData\Local\Temp\atom_script_tempfiles\2021720-13268-1xuqx61.3oh2g", line 53, in <module>
    df = pd.json_normalize(d['view'], record_path=['replies'])
  File "C:\Users\danie\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\json\_normalize.py", line 336, in _json_normalize
    _recursive_extract(data, record_path, {}, level=0)
  File "C:\Users\danie\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\json\_normalize.py", line 309, in _recursive_extract
    recs = _pull_records(obj, path[0])
  File "C:\Users\danie\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\json\_normalize.py", line 248, in _pull_records
    result = _pull_field(js, spec)
  File "C:\Users\danie\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\json\_normalize.py", line 239, in _pull_field
    result = result[spec]
KeyError: 'replies'

What am I missing here? All suggestions welcome and appreciated.

1 Answer 1

3

You’re trying to flatten 2 different “depths” in the json file, which can’t be done in a single json_normalize call. You could simply use 2 pd.json_normalize since all entries contain ids to match all the parsed data later:

>>> pd.json_normalize(d, record_path='view')
       id  user_id parent_id            created_at            updated_at rating_count rating_sum        message                                            replies
0  109205     6354      None  2020-11-03T23:32:49Z  2020-11-03T23:32:49Z         None       None  message text1  [{'id': 109298, 'user_id': 5457, 'parent_id': ...
>>> pd.json_normalize(d, record_path=['view', 'replies'])
       id  user_id  parent_id            created_at            updated_at rating_count rating_sum        message
0  109298     5457     109205  2020-11-04T19:42:59Z  2020-11-04T19:42:59Z         None       None  message text2
1  109299     5457     109205  2020-11-04T19:42:59Z  2020-11-04T19:42:59Z         None       None  message text3

(I’ve added as second reply to your example with same data and id incremented by 1 so we can see what happens for several replies per view.)

Alternately, you can use your second pd.json_normalize on the replies column of your previous result, which is probably less work. This is more interesting if you .explode() the column first to get one row per reply:

>>> pd.json_normalize(view['replies'].explode())
       id  user_id  parent_id            created_at            updated_at rating_count rating_sum        message
0  109298     5457     109205  2020-11-04T19:42:59Z  2020-11-04T19:42:59Z         None       None  message text2
1  109299     5457     109205  2020-11-04T19:42:59Z  2020-11-04T19:42:59Z         None       None  message text3

So here’s a way to construct a single dataframe with all the info:

>>> view = pd.json_normalize(d, record_path='view')
>>> df = pd.merge(
...     view.drop(columns=['replies']),
...     pd.json_normalize(view['replies'].explode()),
...     left_on='id', right_on='parent_id', how='right',
...     suffixes=('_view', '_reply')
... )
>>> df
   id_view  user_id_view parent_id_view       created_at_view       updated_at_view rating_count_view rating_sum_view   message_view  id_reply  user_id_reply  parent_id_reply      created_at_reply      updated_at_reply rating_count_reply rating_sum_reply  message_reply
0   109205          6354           None  2020-11-03T23:32:49Z  2020-11-03T23:32:49Z              None            None  message text1    109298           5457           109205  2020-11-04T19:42:59Z  2020-11-04T19:42:59Z               None             None  message text2
1   109205          6354           None  2020-11-03T23:32:49Z  2020-11-03T23:32:49Z              None            None  message text1    109299           5457           109205  2020-11-04T19:42:59Z  2020-11-04T19:42:59Z               None             None  message text3
>>> df[['user_id_view', 'message_view', 'user_id_reply', 'message_reply']]
   user_id_view   message_view  user_id_reply  message_reply
0          6354  message text1           5457  message text2
1          6354  message text1           5457  message text3
Sign up to request clarification or add additional context in comments.

2 Comments

personally I'd go with the two tiered approach - two sep dataframes, then then aggregate if necessary and merge.
This was a big help! Greatly appreciated. I'm able to pull out the nested fields now, although some null values required adding dropna. Thanks again!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.