Display nested JSON as a dataframe with all values in their own column

Question

I'm processing data set which originated from JSON so that I get the following resulting format. Below is my expectation:

              category                 type value  value_type  mandatory
0  business-activities  activities-hot-work  true     boolean       true
1   business-employees  employees-full-time     6     integer       true

What I'm getting is the following from df:

              category                 type value                                       validation
0  business-activities  activities-hot-work  true  [{'value_type': 'boolean'}, {'mandatory': True}]
1   business-employees  employees-full-time     6  [{'value_type': 'integer'}, {'mandatory': True}]

Here's my script

import json
import pandas as pd

# broker data received
with open(r'C:\Users\mattl\OneDrive\Everything\Documents\Data files for Python 
Dev\characteristic_values.json') as data_in:
    data = json.load(data_in)
data_inbound = pd.json_normalize(data, 'characteristics',  record_prefix='')

# validation data used to process validation on data received
with open(r'C:\Users\mattl\OneDrive\Everything\Documents\Data files for Python Dev\characteristic_validation.json') as data_val:
data = json.load(data_val)
data_validation = pd.json_normalize(data, 'characteristics',  record_prefix='')

# merge validation with broker data and normalise the data
df = pd.merge(data_inbound, data_validation, on=['category','type'])

print(df)

# Show results
print('Merged and exploded Result')
dfa = df.explode('validation')
print(dfa)

My input files

Characteristic_values.json

{
 "characteristics" :[
     {
         "category" : "business-activities",
         "type" : "activities-hot-work",
         "value" : "true"
     },
     {
         "category" : "business-employees",
         "type" : "employees-full-time",
         "value" : "6"
     }
     ]
}

Characteristic_validation.json

 {
    "characteristics": [{
          "category": "business-activities",
          "type": "activities-hot-work",
          "validation": [{
          "value_type": "boolean"
           }, {
                "mandatory": true
            }]
        },
        {
          "category": "business-employees",
          "type": "employees-full-time",
          "validation": [{
                "value_type": "integer"
             },
             {
                 "mandatory ": true
             }
          ]
       }
    ]
 }

What have I tried already?

characteristics_data = pd.json_normalize(data=df, record_path='validation', meta=['category', 'type', 'value']) I modified one that is working in a tutorial for handling nested JSON but it throws an error which I cannot figure out, but might be on the right track.

Error Messages

  File "C:\Users\mattl\PycharmProjects\jsonValidation\main.py", line 25, in <module>
characteristics_data = pd.json_normalize(data=df, record_path='validation',
  File "C:\Users\mattl\PycharmProjects\jsonValidation\venv\lib\site-packages\pandas\io\json\_normalize.py", line 504, in _json_normalize
_recursive_extract(data, record_path, {}, level=0)
  File "C:\Users\mattl\PycharmProjects\jsonValidation\venv\lib\site-packages\pandas\io\json\_normalize.py", line 477, in _recursive_extract
recs = _pull_records(obj, path[0])
  File "C:\Users\mattl\PycharmProjects\jsonValidation\venv\lib\site-packages\pandas\io\json\_normalize.py", line 399, in _pull_records
result = _pull_field(js, spec)
  File "C:\Users\mattl\PycharmProjects\jsonValidation\venv\lib\site-packages\pandas\io\json\_normalize.py", line 390, in _pull_field
result = result[spec]
TypeError: string indices must be integers

I hope I have provided enough information to explain my issue - thanks

you'll note I also tried a dfa = df.explode('validation') but that gives me 4 rows and still the values contained in the objects are unusable. I left it in just in case that direction ends up being better than normalisation — Matt Lightbourn
– Matt Lightbourn, Commented Dec 3, 2021 at 3:44
I've also tried result = pd.json_normalize(df, 'validation', ['category', 'type', 'value' ['value_type', 'mandatory']]) but errors with TypeError: string indices must be integers again — Matt Lightbourn
– Matt Lightbourn, Commented Dec 3, 2021 at 3:54

Timus · Accepted Answer · 2021-12-03 11:04:59Z

1

With your dataframe df

              category                 type value                                       validation
0  business-activities  activities-hot-work  true  [{'value_type': 'boolean'}, {'mandatory': True}]
1   business-employees  employees-full-time     6  [{'value_type': 'integer'}, {'mandatory': True}]

you could do

df = pd.concat(
    [
        df.drop(columns="validation"),
        pd.DataFrame({**l[0], **l[1]} for l in df.validation)
    ],
    axis=1
)

or a bit more generic

from itertools import chain

df = pd.concat(
    [
        df.drop(columns="validation"),
        pd.DataFrame(dict(chain(*(d.items() for d in l))) for l in df.validation)
    ],
    axis=1
)

Result:

              category                 type value value_type  mandatory
0  business-activities  activities-hot-work  true    boolean       True
1   business-employees  employees-full-time     6    integer       True

answered Dec 3, 2021 at 11:04

Timus

11.4k5 gold badges20 silver badges33 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Matt Lightbourn Over a year ago

That sounds awesome, ill try it out later this evening and let you know, based on expected rssult, looks perfect, thanks

Matt Lightbourn Over a year ago

thank you, works perfectly, I used the more generic approach

Collectives™ on Stack Overflow

Display nested JSON as a dataframe with all values in their own column

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related