Python: Extracting a data frame form highly nested JSON file

Question

I have JSON file that has many nested dictionaries/lists of excess information that I do not want to use when creating my data frame. All the unnecessary fluff I have either deleted or replaced with '---'.

{'ID': 1,
 'SPEC': {'Name': 'STOCK_VAL',
  '---': '---',
  '---': '---',
  'Info': {'---': [{'---': '---', '---': '---', '---': '---'}],
   '---': [{'---': '---', '---': '---', '---': '---'}]},
  '---': '---',
  'RELEVANT_AFTER_ALL': [{'---': '---',
    'Max': 140.00,
    'Min': 100.00,
    '---': '---',
    'Name': 'Calculated',
    'Units': 'USD/D',
    '---': '---',
    'Entries': [{'Timestamp': '2022-03-16T23:00:00Z', 'Value': 100.00},
     {'Timestamp': '2022-03-17T23:00:00Z', 'Value': 120.00},
     {'Timestamp': '2022-03-18T23:00:00Z', 'Value': 140.00}],
    '---': '---'},
   {'---': '---',
    'Max': 160.00,
    'Min': 80.00,
    '---': '---',
    'Name': 'Realised',
    'Units': 'USD/D',
    '---': '---',
    'Entries': [{'Timestamp': '2022-03-16T23:00:00Z', 'Value': 160.00},
     {'Timestamp': '2022-03-17T23:00:00Z', 'Value': 120.00},
     {'Timestamp': '2022-03-18T23:00:00Z', 'Value': 80.00}],
    '---': '---'}]}}

From the data above I want to create the following data frame:

Timestamp	STOCK_VAL Calculated	STOCK_VAL Realised
2022-03-16T23:00:00Z	100.00	160.00
2022-03-17T23:00:00Z	120.00	120.00
2022-03-18T23:00:00Z	140.00	80.00

I have tried using pandas.json_normalize() but failed to extract the table as I want it to be made in an efficient manner.

Thanks in advance for anyone who knows better!

It looks the JSON data you shared is not a valid data. Could you please verify the data you paste here is valid JSON? You can use following website: codebeautify.org/jsonviewer — Baris Ozensel
– Baris Ozensel, Commented Mar 29, 2022 at 8:26
You are correct. This is already a formatted JSON extract. I'll check how to get raw JSON then. — idiot_at_work
– idiot_at_work, Commented Mar 29, 2022 at 8:33
The unfortunately named json_normalize does not, in fact, take JSON, but "unserialized JSON objects", so posting a Python structure instead of JSON is not an issue here. The biggest problems with the data you posted are that it is not complete — it is missing ]}} at the end — and the fact that you anonymised one of the keys that is necessary to access the data you want. — Amadan
– Amadan, Commented Mar 29, 2022 at 8:38

Amadan · Accepted Answer · 2022-03-29 09:05:37Z

1

One of the strings you replaced with '---' is relevant after all.

First we find the array where the data is located. Each item of this array should be a series, from which we can build a dataframe.

import pandas as pd
table_data = data['SPEC']['RELEVANT_AFTER_ALL']
x = pd.DataFrame({
    f"STOCK_VAL {item['Name']}": pd.DataFrame(item['Entries']).set_index('Timestamp').squeeze()
    for item in table_data
})

EDIT: Replaced pd.json_normalize with pd.DataFrame, which suffices in this scenario.

EDIT 2: Added STOCK_VAL to the column names.

edited Mar 29, 2022 at 9:05

answered Mar 29, 2022 at 8:37

Amadan

200k23 gold badges253 silver badges321 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

idiot_at_work Over a year ago

You are correct. I will adjust where the missing link was that I foolishly dropped. Do you know by any chance how I could also get the STOCK_VAL in front of the available entries in the naming?

Amadan Over a year ago

:) Edited, please check.

Collectives™ on Stack Overflow

Python: Extracting a data frame form highly nested JSON file

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related