Flatten deeply nested JSON into multiple rows

Question

I am trying to take the following deeply nested JSON and turn it into (eventually) a csv. Below is just a sample, the full JSON is huge (12GB).

{'reporting_entity_name':'Blue Cross and Blue Shield of Alabama', 
'reporting_entity_type':'health insurance issuer', 
'last_updated_on':'2022-06-10',
'version':'1.1.0', 
'in_network':[
            {'negotiation_arrangement': 'ffs', 
            'name': 'xploration of Kidney', 
            'billing_code_type': 'CPT', 
            'billing_code_type_version': '2022', 
            'billing_code': '50010', 
            'description': 'Renal Exploration, Not Necessitating Other Specific Procedures', 
            'negotiated_rates': [{
                'negotiated_prices': [
                    {'negotiated_type': 'negotiated', 
                    'negotiated_rate': 993.0, 
                    'expiration_date': '2022-06-30', 
                    'service_code': ['21', '22', '24'], 
                    'billing_class': 'professional'}, 
                    {'negotiated_type': 'negotiated', 
                    'negotiated_rate': 1180.68, 
                    'expiration_date': '2022-06-30', 
                    'service_code': ['21', '22', '24'], 
                    'billing_class': 'professional'}, 
                    {'negotiated_type': 'negotiated', 
                    'negotiated_rate': 1283.95, 
                    'expiration_date': '2022-06-30', 
                    'service_code': ['21', '22', '24'], 
                    'billing_class': 'professional'}, 
                    {'negotiated_type': 'negotiated', 
                    'negotiated_rate': 1042.65, 
                    'expiration_date': '2022-06-30', 
                    'service_code': ['21', '22', '24'], 
                    'billing_class': 'professional'}, 
                    {'negotiated_type': 'negotiated', 
                    'negotiated_rate': 1290.9, 
                    'expiration_date': '2022-06-30', 
                    'service_code': ['21', '22', '24'], 
                    'billing_class': 'professional'}, 
                    {'negotiated_type': 'negotiated', 
                    'negotiated_rate': 1241.25, 
                    'expiration_date': '2022-06-30', 
                    'service_code': ['21', '22', '24'], 
                    'billing_class': 'professional'}
                    }]}]}

The end goal is to have a data frame, or dictionary, that I can then write to a csv. I hope to have each row with the columns:

{'reporting_entity_name':'','reporting_entity_type':'','last_updated_on':'','version':'','negotiation_arrangement':'','name':'','billing_code_type':'','billing_code_type_version':'','billing_code':'','description':'','provider_groups':'','negotiated_type':'','negotiated_rate':'','expiration_date':'','service_code':'','billing_class':''}

So far I have tried pandas normalize_json, flatten, and a few custom modules I found on GitHub. But none seem to normalize/flatten the data into new rows, only columns. Because this is such a huge dataset I'm worried about doing this recursively in a bunch of nested loops because I fear it will quickly eat up all my memory. Thanks in advance for any advice you can offer!

i am curious, how were you able to accommodate this single line json data file to a df ? What all memory management did you use ? — Chetan_Vasudevan
– Chetan_Vasudevan, Commented Jul 28, 2022 at 2:29
@Chetan_Vasudevan I'm fortunate to have a computer with a lot of processing power and 4TB memory, so I processed it in memory. But for more sustainability I'm migrating my code over to an Azure vm with even more memory. — pbthehuman
– pbthehuman, Commented Jul 29, 2022 at 17:42

westandskif · Accepted Answer · 2022-07-29 19:56:04Z

First of all, I must confess, I'm the author of the library I'm going to use for this -- convtools (github).

Presuming your data is a list of samples you provided:

input_data = [
    {
        "reporting_entity_name": "blue cross and blue shield of alabama",
        "reporting_entity_type": "health insurance issuer",
        "last_updated_on": "2022-06-10",
        "version": "1.1.0",
        "in_network": [
            {
                "negotiation_arrangement": "ffs",
                "name": "xploration of kidney",
                "billing_code_type": "cpt",
                "billing_code_type_version": "2022",
                "billing_code": "50010",
                "description": "renal exploration, not necessitating other specific procedures",
                "negotiated_rates": [
                    {
                        "negotiated_prices": [
                            {
                                "negotiated_type": "negotiated",
                                "negotiated_rate": 993.0,
                                "expiration_date": "2022-06-30",
                                "service_code": ["21", "22", "24"],
                                "billing_class": "professional",
                            },
                            {
                                "negotiated_type": "negotiated",
                                "negotiated_rate": 1180.68,
                                "expiration_date": "2022-06-30",
                                "service_code": ["21", "22", "24"],
                                "billing_class": "professional",
                            },
                        ]
                    }
                ],
            }
        ],
    }
]

I'm defining kind of a config you can tweak to add/remove fields to be retrieved per level. Then the function recursively builds a converter, which is then compiled to an ad hoc python function without recursion.

from convtools.contrib.tables import Table
from convtools import conversion as c

flattening_levels = [
    {
        "fields": ["reporting_entity_name", "reporting_entity_type"],
        "list_field": "in_network",
    },
    {
        "fields": ["negotiation_arrangement", "name"],
        "list_field": "negotiated_rates",
    },
    {"list_field": "negotiated_prices"},
    {
        "fields": ["negotiated_type", "negotiated_rate"],
    },
]

def build_conversion(levels, index=0):
    level_config = levels[index]

    # basic step is to iterate an input
    return c.iter(
        c.zip(
            # zip nested list
            (
                c.item(level_config["list_field"]).pipe(
                    build_conversion(levels, index + 1)
                )
                if "list_field" in level_config
                else c.naive([{}])
            ),
            # with repeated fields of the current level
            c.repeat(
                {
                    field: c.item(field)
                    for field in level_config.get("fields", ())
                }
            ),
        )
        # update inner objects with fields from the current level
        .iter_mut(c.Mut.custom(c.item(0).call_method("update", c.item(1))))
        # take inner objects, forgetting about the top level
        .iter(c.item(0))
    ).flatten()

flatten_objects = build_conversion(flattening_levels).gen_converter()

Table.from_rows(flatten_objects(input_data)).into_csv("out.csv")

out.csv looks like:

negotiated_type,negotiated_rate,negotiation_arrangement,name,reporting_entity_name,reporting_entity_type
negotiated,993.0,ffs,xploration of kidney,blue cross and blue shield of alabama,health insurance issuer
negotiated,1180.68,ffs,xploration of kidney,blue cross and blue shield of alabama,health insurance issuer

Let's profile:

import tqdm

many_objects = input_data * 1000000

"""
In [151]: %time Table.from_rows(tqdm.tqdm(flatten_objects(many_objects))).into_csv("out.csv")

2000000it [00:07, 274094.52it/s]
CPU times: user 7.13 s, sys: 141 ms, total: 7.27 s
Wall time: 7.3 s
"""

# resulting file is ~210MB

This solution can be optimized not to update dicts multiple times. Please, let me know if it's needed.

added to address missing key issues: Notice added default arguments, this works just like dict.get(key, default=None):

def build_conversion(levels, index=0):
    level_config = levels[index]

    # basic step is to iterate an input
    return c.iter(
        c.zip(
            # zip nested list
            (
                c.item(level_config["list_field"], default=c([{}])).pipe(
                    build_conversion(levels, index + 1)
                )
                if "list_field" in level_config
                else c.naive([{}])
            ),
            # with repeated fields of the current level
            c.repeat(
                {
                    field: c.item(field, default=None)
                    for field in level_config.get("fields", ())
                }
            ),
        )
        # update inner objects with fields from the current level
        .iter_mut(c.Mut.custom(c.item(0).call_method("update", c.item(1))))
        # take inner objects, forgetting about the top level
        .iter(c.item(0))
    ).flatten()

This is a fantastic solution, thank you!! It worked beautifully and helped this poor data wonk get past a seemingly trivial obstacle. THANK YOU!
sorry to revisit this, but I ran into a hiccup. Is there a way to modify the flattening_levels so it doesn't error out if one of the fields is missing? for example, if one of the nested json levels is missing the key 'negotiated_rate' right now, the Table.from_rows line throws an error. Thank you again!
@pbthehuman I've added an updated "build_conversion" method which includes default values for cases where keys are missing. Of course it is a bit slower, but just a bit - 7.86s vs 7.3s.

Collectives™ on Stack Overflow

Flatten deeply nested JSON into multiple rows

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related