0

I have an complex/nested JSON, that i need to transform into DataFrame (Python). I could get the first part, but i'm struggling to solve the second part.

import requests
from pandas.io.json import json_normalize
import json

url = 'url'

headers = {'api-key':'key'}

resp = requests.get(url, headers = headers)
print(resp.status_code)

r = resp.content
r

responses = json.loads(r.decode('utf-8'))
responses

Output (responses)

{'count': 855,
 'requestAt': '2020-07-15T13:13:26.646+00:00',
 'data': {'00b3dc3a-b71e-4547-8910-44691a09cd53': {'registerId': '00b3dc3a-b71e-4547-8910-44691a09cd53',
   'count': 10,
   'milho_germoplasma': {'feedbackScore': 'good',
    'firstVisitAt': '2020-06-11T11:10:42.929-03:00',
    'lastVisitAt': '2020-06-15T15:36:43.027-03:00',
    'videosCompletedAt': '2020-06-11T11:19:58.753-03:00',
    'videosState': [{'completedAt': '2020-06-11T11:19:58.753-03:00',
      'completedCount': 1,
      'duration': 544.811,
      'firstPlayAt': '2020-06-11T11:10:50.170-03:00',
      'percent': 0.281,
      'playCount': 3,
      'seconds': 152.85,
      'updatedAt': '2020-06-15T15:38:13.711-03:00',
      'videoSrc': 'https://vimeo.com/420453289/b7c455699a'}],
    'visitsCount': 3,
    'stationId': 'milho_germoplasma'},
   'milho_plantio': {'feedbackScore': 'good',
    'firstVisitAt': '2020-06-11T10:37:42.509-03:00',
    'lastVisitAt': '2020-06-11T12:28:21.105-03:00',
    'videosCompletedAt': '2020-06-11T10:49:43.082-03:00',
    'videosState': [{'completedAt': '2020-06-11T10:49:43.082-03:00',
      'completedCount': 1,
      'duration': 700.459,
      'firstPlayAt': '2020-06-11T10:37:50.465-03:00',
      'percent': 0.042,
      'playCount': 2,
      'seconds': 29.18,
      'updatedAt': '2020-06-11T10:50:18.717-03:00',
      'videoSrc': 'https://player.vimeo.com/video/412760474'}],
    'visitsCount': 2,
    'stationId': 'milho_plantio'}}}}

I tried to use an adaptation of some responses on StackOverflow, but i could solve just part of it without error:

response_list = []
for id in responses['data']:

    # get the keys of interest
    data = {k: v for k, v in responses['data'][id].items() if k in ['registerId', 'count']}

    response_list.append({**data})      

print(pd.DataFrame(response_list))

Output:

+--------------------------------------+-------+
|             registerId               | count |
+--------------------------------------+-------+
| 00b3dc3a-b71e-4547-8910-44691a09cd53 |    10 |
+--------------------------------------+-------+

I need to get inside the next level of this json and turn it into a DataFrame: (each milho_germoplasma/milho_plantio/whatever create a new row for the same registerId with the data inside)

Expected Output:

+--------------------------------------+-------+---------------+-------------------------------+----------------------------------+-------------------+
|              registerId              | count | feedbackScore |         firstVisitAt          |           lastVisitAt            |  …(last column)   |
+--------------------------------------+-------+---------------+-------------------------------+----------------------------------+-------------------+
| 00b3dc3a-b71e-4547-8910-44691a09cd53 |    10 | good          | 2020-06-11T11:10:42.929-03:00 | '2020-06-15T15:36:43.027-03:00', | milho_germoplasma |
| 00b3dc3a-b71e-4547-8910-44691a09cd53 |    10 | good          | 2020-06-11T10:37:42.509-03:00 | 2020-06-11T12:28:21.105-03:00    | milho_plantio     |
+--------------------------------------+-------+---------------+-------------------------------+----------------------------------+-------------------+

1 Answer 1

1

unpacking a nested json is not trivial, you can use a recursive approach to solve this problem.

If you have a fixed json structure, like you showed, a simpler approach would be the following.

import pandas as pd
def unpack(data):
    f = {}
    for k,v in data.items():
        if isinstance(v, (int, float, str)):
            if k in f.keys():
                f[k].append(v)
            else:
                f[k] = [v]
        elif isinstance(v, (list, tuple)):
            for ele in v:
                if isinstance(ele, (dict)):
                    for k2, val in ele.items():
                        key = f'{k}_{k2}'
                        if key in f.keys():
                            f[key].append(val)
                        else:
                            f[key] = [val]
    return f

for _id in e['data']:
    data = e['data'].get(_id)
    registerID = data.pop('registerId') if 'registerId' in data else None
    count = e['data'].get(_id).pop('count') if 'count' in data else 0
    dfs = []
    for specie in data.keys():
        f = unpack(data.get(specie))
        aux_df = pd.DataFrame(f)
        aux_df['registerID'] = registerID
        aux_df['count'] = count
        dfs.append(aux_df)

df = pd.concat(dfs)
print(df)

Result:

  feedbackScore                   firstVisitAt                    lastVisitAt  \
0          good  2020-06-11T11:10:42.929-03:00  2020-06-15T15:36:43.027-03:00   
0          good  2020-06-11T10:37:42.509-03:00  2020-06-11T12:28:21.105-03:00   

               videosCompletedAt        videosState_completedAt  \
0  2020-06-11T11:19:58.753-03:00  2020-06-11T11:19:58.753-03:00   
0  2020-06-11T10:49:43.082-03:00  2020-06-11T10:49:43.082-03:00   

   videosState_completedCount  videosState_duration  \
0                           1               544.811   
0                           1               700.459   

         videosState_firstPlayAt  videosState_percent  videosState_playCount  \
0  2020-06-11T11:10:50.170-03:00                0.281                      3   
0  2020-06-11T10:37:50.465-03:00                0.042                      2   

   videosState_seconds          videosState_updatedAt  \
0               152.85  2020-06-15T15:38:13.711-03:00   
0                29.18  2020-06-11T10:50:18.717-03:00   

                       videosState_videoSrc  visitsCount          stationId  \
0    https://vimeo.com/420453289/b7c455699a            3  milho_germoplasma   
0  https://player.vimeo.com/video/412760474            2      milho_plantio   

                             registerID  count  
0  00b3dc3a-b71e-4547-8910-44691a09cd53     10  
0  00b3dc3a-b71e-4547-8910-44691a09cd53     10  
Sign up to request clarification or add additional context in comments.

3 Comments

Due to time issues, I haven't finished, but it solves your problem. A more elegant approach would be a recursive approach expanding the unpack func
Also, good to know you're from Brazil and work @ DASA. If you want to keep in touch you can DM me on linkedin (Gilberto Kaihami)
Hi Gilberto, thanks for the answer! Your code works pretty well with the example that i gave here, but unfortunately when i try it with the big json, it doesnt work. I didnt expected that, but i dont receive a fixed json structure. Some users doesnt respond the "feedbackScore" for example, then it doesnt show up. Im gonna send you a contact request on lkd, thanks!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.