1

I have a json object

sample_json ={
"workspaces": [
    {
        "wsid": "1",
        "wsname": "firstworkspace",
        "report":[{"reportname":"r1ws1"},{"reportname":"r2ws1"},{"reportname":"r1ws1"}]
    },
    {
        "wsid": "2",
        "wsname": "secondworkspace",
        "report":[{"reportname":"r1ws2"},{"reportname":"r2ws2"},{"reportname":"r1ws2"}]
    }
]}

I want to flatten/normalize the json so that it looks like this :

enter image description here

I can flatten each level using the pandas function "json_normalize" e.g Level 1

import pandas as pd
df1 = pd.json_normalize(sample_json, record_path=['workspaces'])
df1

enter image description here

e.g Level 2

import pandas as pd
df_api = pd.json_normalize(sample_json, record_path=['workspaces','report'])
df_api

enter image description here

I have read quiet a few articles such as this one but I am struggling to understand how to include information from the two levels (workspaces and reports) in a single dataframe. Any advice or help would be appreciated. Copy to sample data in a google colab notebook is here

1
  • 1
    this has nothing to do with JSON. There is no JSON involved, there is a dictioanry Commented Nov 11, 2022 at 19:25

2 Answers 2

1

The package jertl can be used for this.

In [1]: import jertl

In [2]: import pandas as pd

In [3]: sample_json = ... #removed for brevity

In [4]: pattern = '''
   ...: {
   ...: "workspaces": [*_,
   ...:     {
   ...:         "wsid":   wsid,
   ...:         "wsname": wsname,
   ...:         "report": [*_, {"reportname": reportname}, *_]
   ...:     },
   ...:     *_]
   ...: }
   ...: '''

In [5]: df = pd.DataFrame(match_.bindings for match_ in jertl.match_all(pattern, sample_json))

In [6]: df
Out[6]:
  wsid           wsname reportname
0    1   firstworkspace      r1ws1
1    1   firstworkspace      r2ws1
2    1   firstworkspace      r1ws1
3    2  secondworkspace      r1ws2
4    2  secondworkspace      r2ws2
5    2  secondworkspace      r1ws2

pattern defines the structure to look for. It is intentionally JSON like. wsid, wsname and reportname are variables. The *_s are splatted anonymous variables which match any number of list items.

Shameless self promotion alert. I am the author of jertl.

Sign up to request clarification or add additional context in comments.

Comments

0

So, just normalize this yourself.

df = pd.DataFrame(
    [
        {"wsid": ws['wsid'], "wsname": ws['wsname'], **rep}
        for ws in sample_json["workspaces"] for rep in ws['report']
    ]
)

1 Comment

Thanks @juanpa.arrivillaga Can you explain the mechanics of what is going on here?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.