0

I have a kafka stream to consume which contains some information in JSON form.

I need to convert that json data into Pandas dataframe to feed it further into a data warehouse.

The problem is that the json data structure keeps on changing depending upon the event types.

Example:

The first event comes in with the structure as:

{
    "organization": "nation1",
    "job_id": 1,
    "job_name": "job1",
    "state": {
        "started": "no"
    },
    "timetamp": 1570357814930
}

Then another event comes in with this structure:

{
    "organization": "nation1",
    "job_id": 1,
    "job_name": "job1",
    "state": {
        "started": "yes",
        "attended": "yes"
    },
    "timetamp": 1570357814988
}

Notice the change in state object above.

Assuming that the lowest level structure/hierarchy is not going to change, i.e.; the state object can have at-max the started and attended key-value pairs but not more. Although as can be seen in first event, the state object has only started in it.

How can I make sure that I get a pandas dataframe like below for such scenario. Keeping in mind that the actual json will have many such fields/maps which will have dynamic structure like this

enter image description here

1

1 Answer 1

2

As @Chris A suggested, I think it can be solved using json_normalize. Try like these.

import json
from pandas.io.json import json_normalize

data = '''
[{
    "organization": "nation1",
    "job_id": 1,
    "job_name": "job1",
    "state": {
        "started": "no"
    },
    "timetamp": 1570357814930
},
{
    "organization": "nation1",
    "job_id": 1,
    "job_name": "job1",
    "state": {
        "started": "yes",
        "attended": "yes"
    },
    "timetamp": 1570357814988
}]
'''

json_normalize(json.loads(data))

it gives you the below dataframe

    organization    job_id  job_name    timetamp    state.started   state.attended
0   nation1     1   job1    1570357814930   no  NaN
1   nation1     1   job1    1570357814988   yes     yes

Add columns which are not present

data = '''
{
    "organization": "nation1",
    "job_id": 1,
    "job_name": "job1",
    "state": {
        "started": "no"
    },
    "timetamp": 1570357814930
}
'''
df = json_normalize(json.loads(data))
expected_columns = {'state.started', 'state.attended'}
for column in expected_columns - set(df.columns):
    df[column] = 'null'
df
Sign up to request clarification or add additional context in comments.

2 Comments

This would work if the complete structure (all the columns) are available in any one of the json messages within 1 json file. for example event-2 in my case contains all the columns that should come in . But since I am fetching it using kafka, I will be getting these events independently. So I would get the first json and I have to read it into pandas, then the second one will come in after some time. one option I could think of is to define all the columns that may come in, and then dump the incoming messages into that pandas DF ?
In that case, you can add columns which are not present in current df with default value null . I'll update my answer with that change.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.