2

I have the json in this format:

{
    "fields": {
        "tcidte": {
            "mode": "required",
            "type": "date",
            "format": "%Y%m%d"
        },
        "tcmcid": {
            "mode": "required",
            "type": "string"
        },
        "tcacbr": {
            "mode": "required",
            "type": "string"
        }
    }
}

I want it to be in a dataframe format where each of the three field names are separate rows. Where one row has a column(e.g "format") where others are blank should be assumed to be NULL.

I have tried to use the flatten_json function which I found on here, but doesn't work as expected but will still include here:

def flatten_json(nested_json, exclude=['']):
    """Flatten json object with nested keys into a single level.
        Args:
            nested_json: A nested json object.
            exclude: Keys to exclude from output.
        Returns:
            The flattened json object if successful, None otherwise.
    """
    out = {}

    def flatten(x, name='', exclude=exclude):
        if type(x) is dict:
            for a in x:
                if a not in exclude: flatten(x[a], name + a + '_')
        elif type(x) is list:
            i = 0
            for a in x:
                flatten(a, name + str(i) + '_')
                i += 1
        else:
            out[name[:-1]] = x

    flatten(nested_json)
    return out

flatten_json_file = pd.DataFrame(flatten_json(nested_json))
pprint.pprint(flatten_json_file)

Additional Complexity JSON:

{
    "fields": {
        "action": {
            "type": {
                "field_type": "string"
            },
            "mode": "required"
        },
        "upi": {
            "type": {
                "field_type": "string"
            },
            "regex": "^[0-9]{9}$",
            "mode": "required"
        },
        "firstname": {
            "type": {
                "field_type": "string"
            },
            "mode": "required"
        }
    }
}

3 Answers 3

2

With

data = {
    "fields": {
        "tcidte": {
            "mode": "required",
            "type": "date",
            "format": "%Y%m%d"
        },
        "tcmcid": {
            "mode": "required",
            "type": "string"
        },
        "tcacbr": {
            "mode": "required",
            "type": "string"
        }
    }
}

this

df = pd.DataFrame(data["fields"].values())

results in

       mode    type  format
0  required    date  %Y%m%d
1  required  string     NaN
2  required  string     NaN

Is that your goal?

If you want the keys of data["fields"] as index:

df = pd.DataFrame(data["fields"]).T

or

df = pd.DataFrame.from_dict(data["fields"], orient="index")

both result in

            mode    type  format
tcidte  required    date  %Y%m%d
tcmcid  required  string     NaN
tcacbr  required  string     NaN

With

data = {
    "fields": {
        "action": {
            "type": {
                "field_type": "string"
            },
            "mode": "required"
        },
        "upi": {
            "type": {
                "field_type": "string"
            },
            "regex": "^[0-9]{9}$",
            "mode": "required"
        },
        "firstname": {
            "type": {
                "field_type": "string"
            },
            "mode": "required"
        }
    }
}

you could either do

data = {key: {**d, **d["type"]} for key, d in data["fields"].items()}
df = pd.DataFrame.from_dict(data, orient="index").drop(columns="type")

or

df = pd.DataFrame.from_dict(data["fields"], orient="index")
df = pd.concat(
    [df, pd.DataFrame(df.type.to_list(), index=df.index)], axis=1
).drop(columns="type")

with a result like (column positions may differ)

               mode field_type       regex
action     required     string         NaN
upi        required     string  ^[0-9]{9}$
firstname  required     string         NaN
Sign up to request clarification or add additional context in comments.

4 Comments

Cool solution but column needs to have the field name e.g "tcidte" rather than the index
@Elliot_J Do you mean that you want the data["fields"] keys as index: I've just updated my answer to include that option.
your solution is very good and nice and simple. However, can it handle the additional complexity of a further nested dic? I have updated the original question with a new JSON
@Elliot_J Not without further work. I've added 2 possibilites.
1
df= pd.read_json('test.json')
df_fields = pd.DataFrame(df['fields'].values.tolist(), index=df.index)
print(df_fields)

output:

            mode    type  format
tcacbr  required  string     NaN
tcidte  required    date  %Y%m%d
tcmcid  required  string     NaN

Comments

0

One option is the jmespath library, which can be helpful in scenarios such as this:

# pip install jmespath
import jmespath
import pandas as pd

# think of it like a path 
# fields is the first key
# there are sub keys with varying names
# we are only interested in mode, type, format
# hence the * to represent the intermediate key(s)
expression = jmespath.compile('fields.*[mode, type, format]')

pd.DataFrame(expression.search(data), columns = ['mode', 'type', 'format'])

       mode    type  format
0  required    date  %Y%m%d
1  required  string    None
2  required  string    None

jmespath has a host of tools; this however should suffice, and covers scenarios where keys(mode, type, format) are missing in sub dictionaries.

3 Comments

I have run the pip install for jmespath. But I'm getting the following error when I run: (expression = jmespath.compile('fields.*[mode, type, format]')) AttributeError: module 'jmespath' has no attribute 'compile'
I'm running on Python version 3.6
Try version 3.8

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.