How to normalize complex JSON structure with 4 levels of nested arrays?

Ask Question

Asked 2 years, 10 months ago

Modified 2 years, 9 months ago

Viewed 185 times

As I'm fairly new to python I've tried various ways based on answers found here but can't manage to normalize my json file. As I checked in Postman it has 4 levels of nested arrays. For suppliers I want to expand all the levels of data.The problem I have is with the score_card subtree for which I want to pull out all risk_groups , then name,risk_score and all riskstogether with name,risk_score and indicators with the deepest level containing again name and risk_score. and I'm not sure whether it is possible in one go based on ambiguous names between levels. Below I'm sharing a reproducible example:

data = {"suppliers": [
            {
                "id": 1260391,
                "name": "2712270 MANITOBA LTD",
                "duns": "rm-071-7291",
                "erp_number": "15189067;15189067",
                "material_group_ids": [
            176069
                ],
                "business_unit_ids": [
            13728
        ],
                "responsible_user_ids": [
            37761,
            37760,
            37759,
            37758,
            37757,
            37756,
            36520,
            37587,
            36494,
            22060,
            37742,
            36446,
            36289
        ],
                "address": {
                "address1": "BOX 550 NE 16-20-26",
                "address2": None,
                "zip_code": "R0J 1W0",
                "city": "RUSSELL",
                "state": None,
                "country_code": "ca",
                "latitude": 50.7731176,
                "longitude": -101.2862461
        },
                "score_card": {
                "risk_score": 26.13,
                "risk_groups": [
                    {
                    "name": "Viability",
                    "risk_score": 43.33,
                    "risks": [
                        {
                            "name": "Financial stability supplier",
                            "risk_score": None,
                            "indicators": [
                                {
                                    "name": "Acquisitions",
                                    "risk_score": None
                                }
                            ]
                        }
                    ]
                },
            ]
        }
    }
 ]
       }

And here is how it should look:

expected = [[1260391,'2712270 MANITOBA LTD','rm-071-7291',176069,13728,[37761,37760,        
37759,37758,37757,37756,36520,37587,36494,22060,37742,36446,36289],
'BOX 550 NE 16-20-26','None','R0J 1W0','RUSSELL','None','ca',50.7731176,
-101.2862461,26.13,'Viability',43.33,'Financial stability supplier','None',
'Acquisitions','None']]

df = pd.DataFrame(expected,columns=['id','name','duns','material_groups_ids',
'business_unit_ids','responsible_user_ids',
'address.address1','address.address2','address.zip_code',
'address.city','address.state','address.country_code','address.latitude',
'address.longitude','risk_score','risk_groups.name',
'risk_groups.risk_score','risk_groups.risks.name',
'risk_groups.risks.risk_score','risk_groups.risks.indicators.name',
'risk_groups.risks.indicators.risk_score'])

What I tried since last 3 days is to use json_normalize:

example = pd.json_normalize(data['suppliers'],
record_path=['score_card','risk_groups','risks',
'indicators'],meta = ['id','name','duns','erp_number','material_group_ids',
'business_unit_ids','responsible_user_ids','address',['score_card','risk_score']],record_prefix='_')

but when I specify the fields for record_path which seems to be a vital param here it always return this prompt:

TypeError: {'name': 'Acquisitions', 'risk_score': None} has non list value Acquisitions for path name. Must be list or null.

What I want to get a a table with columns as presented below:

So it seems that python doesn't know how to treat most granular level which contains null values.

I've tried approach from here: Use json_normalize to normalize json with nested arrays but unsuccessfully. I would like to get the same view which I was able to 'unpack' in Power BI but this way of having the data is not satisfying me. All risk indicators are the components of risks which are part of risk groups as You can see on attached pictures. Going to the most granular level it goes this way: Risk groups -> Risks -> Indicators so I would like to unpack it all.

Any help more than desperately needed. Kindly thank You for any support here.

edited Feb 14, 2023 at 13:24

asked Feb 2, 2023 at 11:49

Rafał Pietrak

456 bronze badges

Try adding record_prefix='_' to the parameters of json_normalize

sagi
– sagi

2023-02-02 11:53:21 +00:00
Commented Feb 2, 2023 at 11:53
Thanks @sagi, at least the code did not broke however it did not return any nested fields, right now I've added name from the deepest level to pull out and it gave me an error saying: " TypeError: {'name': 'Acquisitions', 'risk_score': None} has non list value Acquisitions for path name. Must be list or null." So should I do kind of transformation ?

Rafał Pietrak
– Rafał Pietrak

2023-02-02 12:02:10 +00:00
Commented Feb 2, 2023 at 12:02
It should still return nested fields but their names won't be the same. Try debugging the result and see which fields it holds.

sagi
– sagi

2023-02-02 12:03:48 +00:00
Commented Feb 2, 2023 at 12:03
1

Your json indenting is all over the place, so I can't tell what belongs where without scrolling all over the place. Please fix the indentation. Also, minimal reproducible example suggest minimal examples. Do you really need such a long json object to demonstrate this? Same comment for your expect results. Please do not just dump long streams of data on a single line.

MatBailie
– MatBailie

2023-02-07 09:22:36 +00:00
Commented Feb 7, 2023 at 9:22
1

Why did you add 'name' to record_path param? score_card.risk_groups.risks.indicators is a list of records; score_card.risk_groups.risks.indicators.name isn't. But there are bugs (pandas-dev/pandas#34465, pandas-dev/pandas#37782) that are blockers for extracting address.* and responsible_user_ids.

aaron
– aaron

2023-02-11 00:53:20 +00:00
Commented Feb 11, 2023 at 0:53

| Show 2 more comments

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

How to normalize complex JSON structure with 4 levels of nested arrays?

0

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Linked