Transform complex/flattened JSON into DataFrame

Question

I have an complex/nested JSON, that i need to transform into DataFrame (Python). I could get the first part, but i'm struggling to solve the second part.

import requests
from pandas.io.json import json_normalize
import json

url = 'url'

headers = {'api-key':'key'}

resp = requests.get(url, headers = headers)
print(resp.status_code)

r = resp.content
r

responses = json.loads(r.decode('utf-8'))
responses

Output (responses)

{'count': 855,
 'requestAt': '2020-07-15T13:13:26.646+00:00',
 'data': {'00b3dc3a-b71e-4547-8910-44691a09cd53': {'registerId': '00b3dc3a-b71e-4547-8910-44691a09cd53',
   'count': 10,
   'milho_germoplasma': {'feedbackScore': 'good',
    'firstVisitAt': '2020-06-11T11:10:42.929-03:00',
    'lastVisitAt': '2020-06-15T15:36:43.027-03:00',
    'videosCompletedAt': '2020-06-11T11:19:58.753-03:00',
    'videosState': [{'completedAt': '2020-06-11T11:19:58.753-03:00',
      'completedCount': 1,
      'duration': 544.811,
      'firstPlayAt': '2020-06-11T11:10:50.170-03:00',
      'percent': 0.281,
      'playCount': 3,
      'seconds': 152.85,
      'updatedAt': '2020-06-15T15:38:13.711-03:00',
      'videoSrc': 'https://vimeo.com/420453289/b7c455699a'}],
    'visitsCount': 3,
    'stationId': 'milho_germoplasma'},
   'milho_plantio': {'feedbackScore': 'good',
    'firstVisitAt': '2020-06-11T10:37:42.509-03:00',
    'lastVisitAt': '2020-06-11T12:28:21.105-03:00',
    'videosCompletedAt': '2020-06-11T10:49:43.082-03:00',
    'videosState': [{'completedAt': '2020-06-11T10:49:43.082-03:00',
      'completedCount': 1,
      'duration': 700.459,
      'firstPlayAt': '2020-06-11T10:37:50.465-03:00',
      'percent': 0.042,
      'playCount': 2,
      'seconds': 29.18,
      'updatedAt': '2020-06-11T10:50:18.717-03:00',
      'videoSrc': 'https://player.vimeo.com/video/412760474'}],
    'visitsCount': 2,
    'stationId': 'milho_plantio'}}}}

I tried to use an adaptation of some responses on StackOverflow, but i could solve just part of it without error:

response_list = []
for id in responses['data']:

    # get the keys of interest
    data = {k: v for k, v in responses['data'][id].items() if k in ['registerId', 'count']}

    response_list.append({**data})      

print(pd.DataFrame(response_list))

Output:

+--------------------------------------+-------+
|             registerId               | count |
+--------------------------------------+-------+
| 00b3dc3a-b71e-4547-8910-44691a09cd53 |    10 |
+--------------------------------------+-------+

I need to get inside the next level of this json and turn it into a DataFrame: (each milho_germoplasma/milho_plantio/whatever create a new row for the same registerId with the data inside)

Expected Output:

+--------------------------------------+-------+---------------+-------------------------------+----------------------------------+-------------------+
|              registerId              | count | feedbackScore |         firstVisitAt          |           lastVisitAt            |  …(last column)   |
+--------------------------------------+-------+---------------+-------------------------------+----------------------------------+-------------------+
| 00b3dc3a-b71e-4547-8910-44691a09cd53 |    10 | good          | 2020-06-11T11:10:42.929-03:00 | '2020-06-15T15:36:43.027-03:00', | milho_germoplasma |
| 00b3dc3a-b71e-4547-8910-44691a09cd53 |    10 | good          | 2020-06-11T10:37:42.509-03:00 | 2020-06-11T12:28:21.105-03:00    | milho_plantio     |
+--------------------------------------+-------+---------------+-------------------------------+----------------------------------+-------------------+

kaihami · Accepted Answer · 2020-07-15 16:15:05Z

1

unpacking a nested json is not trivial, you can use a recursive approach to solve this problem.

If you have a fixed json structure, like you showed, a simpler approach would be the following.

import pandas as pd
def unpack(data):
    f = {}
    for k,v in data.items():
        if isinstance(v, (int, float, str)):
            if k in f.keys():
                f[k].append(v)
            else:
                f[k] = [v]
        elif isinstance(v, (list, tuple)):
            for ele in v:
                if isinstance(ele, (dict)):
                    for k2, val in ele.items():
                        key = f'{k}_{k2}'
                        if key in f.keys():
                            f[key].append(val)
                        else:
                            f[key] = [val]
    return f

for _id in e['data']:
    data = e['data'].get(_id)
    registerID = data.pop('registerId') if 'registerId' in data else None
    count = e['data'].get(_id).pop('count') if 'count' in data else 0
    dfs = []
    for specie in data.keys():
        f = unpack(data.get(specie))
        aux_df = pd.DataFrame(f)
        aux_df['registerID'] = registerID
        aux_df['count'] = count
        dfs.append(aux_df)

df = pd.concat(dfs)
print(df)

Result:

  feedbackScore                   firstVisitAt                    lastVisitAt  \
0          good  2020-06-11T11:10:42.929-03:00  2020-06-15T15:36:43.027-03:00   
0          good  2020-06-11T10:37:42.509-03:00  2020-06-11T12:28:21.105-03:00   

               videosCompletedAt        videosState_completedAt  \
0  2020-06-11T11:19:58.753-03:00  2020-06-11T11:19:58.753-03:00   
0  2020-06-11T10:49:43.082-03:00  2020-06-11T10:49:43.082-03:00   

   videosState_completedCount  videosState_duration  \
0                           1               544.811   
0                           1               700.459   

         videosState_firstPlayAt  videosState_percent  videosState_playCount  \
0  2020-06-11T11:10:50.170-03:00                0.281                      3   
0  2020-06-11T10:37:50.465-03:00                0.042                      2   

   videosState_seconds          videosState_updatedAt  \
0               152.85  2020-06-15T15:38:13.711-03:00   
0                29.18  2020-06-11T10:50:18.717-03:00   

                       videosState_videoSrc  visitsCount          stationId  \
0    https://vimeo.com/420453289/b7c455699a            3  milho_germoplasma   
0  https://player.vimeo.com/video/412760474            2      milho_plantio   

                             registerID  count  
0  00b3dc3a-b71e-4547-8910-44691a09cd53     10  
0  00b3dc3a-b71e-4547-8910-44691a09cd53     10

answered Jul 15, 2020 at 16:15

kaihami

8158 silver badges19 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

kaihami Over a year ago

Due to time issues, I haven't finished, but it solves your problem. A more elegant approach would be a recursive approach expanding the unpack func

kaihami Over a year ago

Also, good to know you're from Brazil and work @ DASA. If you want to keep in touch you can DM me on linkedin (Gilberto Kaihami)

Felipe Ribeiro Over a year ago

Hi Gilberto, thanks for the answer! Your code works pretty well with the example that i gave here, but unfortunately when i try it with the big json, it doesnt work. I didnt expected that, but i dont receive a fixed json structure. Some users doesnt respond the "feedbackScore" for example, then it doesnt show up. Im gonna send you a contact request on lkd, thanks!

Collectives™ on Stack Overflow

Transform complex/flattened JSON into DataFrame

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related