2

I am trying to take the following deeply nested JSON and turn it into (eventually) a csv. Below is just a sample, the full JSON is huge (12GB).

{'reporting_entity_name':'Blue Cross and Blue Shield of Alabama', 
'reporting_entity_type':'health insurance issuer', 
'last_updated_on':'2022-06-10',
'version':'1.1.0', 
'in_network':[
            {'negotiation_arrangement': 'ffs', 
            'name': 'xploration of Kidney', 
            'billing_code_type': 'CPT', 
            'billing_code_type_version': '2022', 
            'billing_code': '50010', 
            'description': 'Renal Exploration, Not Necessitating Other Specific Procedures', 
            'negotiated_rates': [{
                'negotiated_prices': [
                    {'negotiated_type': 'negotiated', 
                    'negotiated_rate': 993.0, 
                    'expiration_date': '2022-06-30', 
                    'service_code': ['21', '22', '24'], 
                    'billing_class': 'professional'}, 
                    {'negotiated_type': 'negotiated', 
                    'negotiated_rate': 1180.68, 
                    'expiration_date': '2022-06-30', 
                    'service_code': ['21', '22', '24'], 
                    'billing_class': 'professional'}, 
                    {'negotiated_type': 'negotiated', 
                    'negotiated_rate': 1283.95, 
                    'expiration_date': '2022-06-30', 
                    'service_code': ['21', '22', '24'], 
                    'billing_class': 'professional'}, 
                    {'negotiated_type': 'negotiated', 
                    'negotiated_rate': 1042.65, 
                    'expiration_date': '2022-06-30', 
                    'service_code': ['21', '22', '24'], 
                    'billing_class': 'professional'}, 
                    {'negotiated_type': 'negotiated', 
                    'negotiated_rate': 1290.9, 
                    'expiration_date': '2022-06-30', 
                    'service_code': ['21', '22', '24'], 
                    'billing_class': 'professional'}, 
                    {'negotiated_type': 'negotiated', 
                    'negotiated_rate': 1241.25, 
                    'expiration_date': '2022-06-30', 
                    'service_code': ['21', '22', '24'], 
                    'billing_class': 'professional'}
                    }]}]}

The end goal is to have a data frame, or dictionary, that I can then write to a csv. I hope to have each row with the columns:

{'reporting_entity_name':'','reporting_entity_type':'','last_updated_on':'','version':'','negotiation_arrangement':'','name':'','billing_code_type':'','billing_code_type_version':'','billing_code':'','description':'','provider_groups':'','negotiated_type':'','negotiated_rate':'','expiration_date':'','service_code':'','billing_class':''}

So far I have tried pandas normalize_json, flatten, and a few custom modules I found on GitHub. But none seem to normalize/flatten the data into new rows, only columns. Because this is such a huge dataset I'm worried about doing this recursively in a bunch of nested loops because I fear it will quickly eat up all my memory. Thanks in advance for any advice you can offer!

2
  • i am curious, how were you able to accommodate this single line json data file to a df ? What all memory management did you use ? Commented Jul 28, 2022 at 2:29
  • @Chetan_Vasudevan I'm fortunate to have a computer with a lot of processing power and 4TB memory, so I processed it in memory. But for more sustainability I'm migrating my code over to an Azure vm with even more memory. Commented Jul 29, 2022 at 17:42

1 Answer 1

2

First of all, I must confess, I'm the author of the library I'm going to use for this -- convtools (github).

Presuming your data is a list of samples you provided:

input_data = [
    {
        "reporting_entity_name": "blue cross and blue shield of alabama",
        "reporting_entity_type": "health insurance issuer",
        "last_updated_on": "2022-06-10",
        "version": "1.1.0",
        "in_network": [
            {
                "negotiation_arrangement": "ffs",
                "name": "xploration of kidney",
                "billing_code_type": "cpt",
                "billing_code_type_version": "2022",
                "billing_code": "50010",
                "description": "renal exploration, not necessitating other specific procedures",
                "negotiated_rates": [
                    {
                        "negotiated_prices": [
                            {
                                "negotiated_type": "negotiated",
                                "negotiated_rate": 993.0,
                                "expiration_date": "2022-06-30",
                                "service_code": ["21", "22", "24"],
                                "billing_class": "professional",
                            },
                            {
                                "negotiated_type": "negotiated",
                                "negotiated_rate": 1180.68,
                                "expiration_date": "2022-06-30",
                                "service_code": ["21", "22", "24"],
                                "billing_class": "professional",
                            },
                        ]
                    }
                ],
            }
        ],
    }
]

I'm defining kind of a config you can tweak to add/remove fields to be retrieved per level. Then the function recursively builds a converter, which is then compiled to an ad hoc python function without recursion.

from convtools.contrib.tables import Table
from convtools import conversion as c

flattening_levels = [
    {
        "fields": ["reporting_entity_name", "reporting_entity_type"],
        "list_field": "in_network",
    },
    {
        "fields": ["negotiation_arrangement", "name"],
        "list_field": "negotiated_rates",
    },
    {"list_field": "negotiated_prices"},
    {
        "fields": ["negotiated_type", "negotiated_rate"],
    },
]

def build_conversion(levels, index=0):
    level_config = levels[index]

    # basic step is to iterate an input
    return c.iter(
        c.zip(
            # zip nested list
            (
                c.item(level_config["list_field"]).pipe(
                    build_conversion(levels, index + 1)
                )
                if "list_field" in level_config
                else c.naive([{}])
            ),
            # with repeated fields of the current level
            c.repeat(
                {
                    field: c.item(field)
                    for field in level_config.get("fields", ())
                }
            ),
        )
        # update inner objects with fields from the current level
        .iter_mut(c.Mut.custom(c.item(0).call_method("update", c.item(1))))
        # take inner objects, forgetting about the top level
        .iter(c.item(0))
    ).flatten()

flatten_objects = build_conversion(flattening_levels).gen_converter()

Table.from_rows(flatten_objects(input_data)).into_csv("out.csv")

out.csv looks like:

negotiated_type,negotiated_rate,negotiation_arrangement,name,reporting_entity_name,reporting_entity_type
negotiated,993.0,ffs,xploration of kidney,blue cross and blue shield of alabama,health insurance issuer
negotiated,1180.68,ffs,xploration of kidney,blue cross and blue shield of alabama,health insurance issuer

Let's profile:

import tqdm

many_objects = input_data * 1000000

"""
In [151]: %time Table.from_rows(tqdm.tqdm(flatten_objects(many_objects))).into_csv("out.csv")

2000000it [00:07, 274094.52it/s]
CPU times: user 7.13 s, sys: 141 ms, total: 7.27 s
Wall time: 7.3 s
"""

# resulting file is ~210MB

This solution can be optimized not to update dicts multiple times. Please, let me know if it's needed.

added to address missing key issues: Notice added default arguments, this works just like dict.get(key, default=None):

def build_conversion(levels, index=0):
    level_config = levels[index]

    # basic step is to iterate an input
    return c.iter(
        c.zip(
            # zip nested list
            (
                c.item(level_config["list_field"], default=c([{}])).pipe(
                    build_conversion(levels, index + 1)
                )
                if "list_field" in level_config
                else c.naive([{}])
            ),
            # with repeated fields of the current level
            c.repeat(
                {
                    field: c.item(field, default=None)
                    for field in level_config.get("fields", ())
                }
            ),
        )
        # update inner objects with fields from the current level
        .iter_mut(c.Mut.custom(c.item(0).call_method("update", c.item(1))))
        # take inner objects, forgetting about the top level
        .iter(c.item(0))
    ).flatten()

Sign up to request clarification or add additional context in comments.

3 Comments

This is a fantastic solution, thank you!! It worked beautifully and helped this poor data wonk get past a seemingly trivial obstacle. THANK YOU!
sorry to revisit this, but I ran into a hiccup. Is there a way to modify the flattening_levels so it doesn't error out if one of the fields is missing? for example, if one of the nested json levels is missing the key 'negotiated_rate' right now, the Table.from_rows line throws an error. Thank you again!
@pbthehuman I've added an updated "build_conversion" method which includes default values for cases where keys are missing. Of course it is a bit slower, but just a bit - 7.86s vs 7.3s.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.