How to handle dynamic schema of JSON objects in Python Pandas

Question

I have a kafka stream to consume which contains some information in JSON form.

I need to convert that json data into Pandas dataframe to feed it further into a data warehouse.

The problem is that the json data structure keeps on changing depending upon the event types.

Example:

The first event comes in with the structure as:

{
    "organization": "nation1",
    "job_id": 1,
    "job_name": "job1",
    "state": {
        "started": "no"
    },
    "timetamp": 1570357814930
}

Then another event comes in with this structure:

{
    "organization": "nation1",
    "job_id": 1,
    "job_name": "job1",
    "state": {
        "started": "yes",
        "attended": "yes"
    },
    "timetamp": 1570357814988
}

Notice the change in state object above.

Assuming that the lowest level structure/hierarchy is not going to change, i.e.; the state object can have at-max the started and attended key-value pairs but not more. Although as can be seen in first event, the state object has only started in it.

How can I make sure that I get a pandas dataframe like below for such scenario. Keeping in mind that the actual json will have many such fields/maps which will have dynamic structure like this

Try json_normalize ..?

Chris Adams
– Chris Adams

2019-12-16 07:43:59 +00:00
Commented Dec 16, 2019 at 7:43 — Chris Adams
– Chris Adams, Commented Dec 16, 2019 at 7:43

Prince Francis · Accepted Answer · 2019-12-16 08:28:22Z

2

As @Chris A suggested, I think it can be solved using json_normalize. Try like these.

import json
from pandas.io.json import json_normalize

data = '''
[{
    "organization": "nation1",
    "job_id": 1,
    "job_name": "job1",
    "state": {
        "started": "no"
    },
    "timetamp": 1570357814930
},
{
    "organization": "nation1",
    "job_id": 1,
    "job_name": "job1",
    "state": {
        "started": "yes",
        "attended": "yes"
    },
    "timetamp": 1570357814988
}]
'''

json_normalize(json.loads(data))

it gives you the below dataframe

    organization    job_id  job_name    timetamp    state.started   state.attended
0   nation1     1   job1    1570357814930   no  NaN
1   nation1     1   job1    1570357814988   yes     yes

Add columns which are not present

data = '''
{
    "organization": "nation1",
    "job_id": 1,
    "job_name": "job1",
    "state": {
        "started": "no"
    },
    "timetamp": 1570357814930
}
'''
df = json_normalize(json.loads(data))
expected_columns = {'state.started', 'state.attended'}
for column in expected_columns - set(df.columns):
    df[column] = 'null'
df

edited Dec 16, 2019 at 8:28

answered Dec 16, 2019 at 8:01

Prince Francis

3,1071 gold badge16 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

ABusy_developerT Over a year ago

This would work if the complete structure (all the columns) are available in any one of the json messages within 1 json file. for example event-2 in my case contains all the columns that should come in . But since I am fetching it using kafka, I will be getting these events independently. So I would get the first json and I have to read it into pandas, then the second one will come in after some time. one option I could think of is to define all the columns that may come in, and then dump the incoming messages into that pandas DF ?

Prince Francis Over a year ago

In that case, you can add columns which are not present in current df with default value null . I'll update my answer with that change.

Collectives™ on Stack Overflow

How to handle dynamic schema of JSON objects in Python Pandas

1 Answer 1

Add columns which are not present

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Add columns which are not present

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related