Flatting a JSON file into Pandas Dataframe in Python

Question

I have the json in this format:

{
    "fields": {
        "tcidte": {
            "mode": "required",
            "type": "date",
            "format": "%Y%m%d"
        },
        "tcmcid": {
            "mode": "required",
            "type": "string"
        },
        "tcacbr": {
            "mode": "required",
            "type": "string"
        }
    }
}

I want it to be in a dataframe format where each of the three field names are separate rows. Where one row has a column(e.g "format") where others are blank should be assumed to be NULL.

I have tried to use the flatten_json function which I found on here, but doesn't work as expected but will still include here:

def flatten_json(nested_json, exclude=['']):
    """Flatten json object with nested keys into a single level.
        Args:
            nested_json: A nested json object.
            exclude: Keys to exclude from output.
        Returns:
            The flattened json object if successful, None otherwise.
    """
    out = {}

    def flatten(x, name='', exclude=exclude):
        if type(x) is dict:
            for a in x:
                if a not in exclude: flatten(x[a], name + a + '_')
        elif type(x) is list:
            i = 0
            for a in x:
                flatten(a, name + str(i) + '_')
                i += 1
        else:
            out[name[:-1]] = x

    flatten(nested_json)
    return out

flatten_json_file = pd.DataFrame(flatten_json(nested_json))
pprint.pprint(flatten_json_file)

Additional Complexity JSON:

{
    "fields": {
        "action": {
            "type": {
                "field_type": "string"
            },
            "mode": "required"
        },
        "upi": {
            "type": {
                "field_type": "string"
            },
            "regex": "^[0-9]{9}$",
            "mode": "required"
        },
        "firstname": {
            "type": {
                "field_type": "string"
            },
            "mode": "required"
        }
    }
}

Timus · Accepted Answer · 2021-12-07 15:09:00Z

2

With

data = {
    "fields": {
        "tcidte": {
            "mode": "required",
            "type": "date",
            "format": "%Y%m%d"
        },
        "tcmcid": {
            "mode": "required",
            "type": "string"
        },
        "tcacbr": {
            "mode": "required",
            "type": "string"
        }
    }
}

this

df = pd.DataFrame(data["fields"].values())

results in

       mode    type  format
0  required    date  %Y%m%d
1  required  string     NaN
2  required  string     NaN

Is that your goal?

If you want the keys of data["fields"] as index:

df = pd.DataFrame(data["fields"]).T

or

df = pd.DataFrame.from_dict(data["fields"], orient="index")

both result in

            mode    type  format
tcidte  required    date  %Y%m%d
tcmcid  required  string     NaN
tcacbr  required  string     NaN

With

data = {
    "fields": {
        "action": {
            "type": {
                "field_type": "string"
            },
            "mode": "required"
        },
        "upi": {
            "type": {
                "field_type": "string"
            },
            "regex": "^[0-9]{9}$",
            "mode": "required"
        },
        "firstname": {
            "type": {
                "field_type": "string"
            },
            "mode": "required"
        }
    }
}

you could either do

data = {key: {**d, **d["type"]} for key, d in data["fields"].items()}
df = pd.DataFrame.from_dict(data, orient="index").drop(columns="type")

or

df = pd.DataFrame.from_dict(data["fields"], orient="index")
df = pd.concat(
    [df, pd.DataFrame(df.type.to_list(), index=df.index)], axis=1
).drop(columns="type")

with a result like (column positions may differ)

               mode field_type       regex
action     required     string         NaN
upi        required     string  ^[0-9]{9}$
firstname  required     string         NaN

edited Dec 7, 2021 at 15:09

answered Dec 7, 2021 at 12:01

Timus

11.4k5 gold badges20 silver badges33 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Elliot_J Over a year ago

Cool solution but column needs to have the field name e.g "tcidte" rather than the index

Timus Over a year ago

@Elliot_J Do you mean that you want the data["fields"] keys as index: I've just updated my answer to include that option.

Elliot_J Over a year ago

your solution is very good and nice and simple. However, can it handle the additional complexity of a further nested dic? I have updated the original question with a new JSON

Timus Over a year ago

@Elliot_J Not without further work. I've added 2 possibilites.

Wilian · Accepted Answer · 2021-12-07 11:59:33Z

1

df= pd.read_json('test.json')
df_fields = pd.DataFrame(df['fields'].values.tolist(), index=df.index)
print(df_fields)

output:

            mode    type  format
tcacbr  required  string     NaN
tcidte  required    date  %Y%m%d
tcmcid  required  string     NaN

answered Dec 7, 2021 at 11:59

Wilian

1,2676 silver badges11 bronze badges

Comments

sammywemmy · Accepted Answer · 2021-12-07 11:55:54Z

0

One option is the jmespath library, which can be helpful in scenarios such as this:

# pip install jmespath
import jmespath
import pandas as pd

# think of it like a path 
# fields is the first key
# there are sub keys with varying names
# we are only interested in mode, type, format
# hence the * to represent the intermediate key(s)
expression = jmespath.compile('fields.*[mode, type, format]')

pd.DataFrame(expression.search(data), columns = ['mode', 'type', 'format'])

       mode    type  format
0  required    date  %Y%m%d
1  required  string    None
2  required  string    None

jmespath has a host of tools; this however should suffice, and covers scenarios where keys(mode, type, format) are missing in sub dictionaries.

answered Dec 7, 2021 at 11:55

sammywemmy

28.9k4 gold badges21 silver badges35 bronze badges

3 Comments

Elliot_J Over a year ago

I have run the pip install for jmespath. But I'm getting the following error when I run:

(expression = jmespath.compile('fields.*[mode, type, format]')) AttributeError: module 'jmespath' has no attribute 'compile'

Elliot_J Over a year ago

I'm running on Python version 3.6

sammywemmy Over a year ago

Try version 3.8

Collectives™ on Stack Overflow

Flatting a JSON file into Pandas Dataframe in Python

3 Answers 3

4 Comments

Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related