Pandas: reading multi-index JSON as pandas data frame

Question

I have a JSON file as follows:

{
    "ALPHA": [
        {
            "date": "2021-06-22",
            "constituents": {
                "BBB": 0,
                "EEE": 1,
                "BTB": 1,
                "YUY": 1
            }
        },
        {
            "date": "2021-09-07",
            "constituents": {
                "BBB": 0,
                "EEE": 0,
                "BTB": 0,
                "YUY": 0
            }
        }
    ],
    "BETA": [
        {
            "date": "2021-06-22",
            "constituents": {
                "BBB": 1,
                "EEE": 1,
                "BTB": 1,
                "YUY": 1
            }
        },
        {
            "date": "2021-09-07",
            "constituents": {
                "BBB": 1,
                "EEE": 1,
                "BTB": 1,
                "YUY": 1
            }
        }
    ],

    "THETA": [
        {
            "date": "2021-06-22",
            "constituents": {
                "BBB": 0,
                "EEE": 1,
                "BTB": 1,
                "YUY": 0
            }
        },
        {
            "date": "2021-08-20",
            "constituents": {
                "BBB": 0,
                "EEE": 1,
                "BTB": 1,
                "YUY": 0
            }
        },
        {
            "date": "2021-09-07",
            "constituents": {
                "BBB": 0,
                "EEE": 1,
                "BTB": 1,
                "YUY": 0
            }
        }
    ]
}

I want to read the above into a pandas data frame where the first index is the date, the second index is the first keys (i.e. "ALPHA", "BETA", "THETA"), the columns are the inner keys (i.e. "BBB" ,"EEE", "BTB" ,"YUY"), and the cell values are the values of these inner keys.

How can I read that into pandas from the JSON file?

SeaBean · Accepted Answer · 2021-09-16 19:33:01Z

4

You can use pd.Series to import the JSON to a Pandas series with ALPHA, BETA as index and elements as list. Then expand the list of JSON to individual JSON by .explode(). Expand the inner JSON to dataframe by .apply() + pd.Series.

Append date as index by .set_index() with append=True; swap date from second index to first index by .swaplevel().

Finally, take the column constituents and further expand the inner JSON to dataframe by .apply() + pd.Series, as follows:

(assume you have already loaded the JSON file into j)

df = (pd.Series(j)
        .explode()
        .apply(pd.Series)
        .set_index('date', append=True)
        .swaplevel()['constituents']
        .apply(pd.Series)
     )

Data input:

j = {'ALPHA': [{'date': '2021-06-22',
   'constituents': {'BBB': 0, 'EEE': 1, 'BTB': 1, 'YUY': 1}},
  {'date': '2021-09-07',
   'constituents': {'BBB': 0, 'EEE': 0, 'BTB': 0, 'YUY': 0}}],
 'BETA': [{'date': '2021-06-22',
   'constituents': {'BBB': 1, 'EEE': 1, 'BTB': 1, 'YUY': 1}},
  {'date': '2021-09-07',
   'constituents': {'BBB': 1, 'EEE': 1, 'BTB': 1, 'YUY': 1}}],
 'THETA': [{'date': '2021-06-22',
   'constituents': {'BBB': 0, 'EEE': 1, 'BTB': 1, 'YUY': 0}},
  {'date': '2021-08-20',
   'constituents': {'BBB': 0, 'EEE': 1, 'BTB': 1, 'YUY': 0}},
  {'date': '2021-09-07',
   'constituents': {'BBB': 0, 'EEE': 1, 'BTB': 1, 'YUY': 0}}]}

Output

print(df)


                  BBB  EEE  BTB  YUY
date                                
2021-06-22 ALPHA    0    1    1    1
2021-09-07 ALPHA    0    0    0    0
2021-06-22 BETA     1    1    1    1
2021-09-07 BETA     1    1    1    1
2021-06-22 THETA    0    1    1    0
2021-08-20 THETA    0    1    1    0
2021-09-07 THETA    0    1    1    0

edited Sep 16, 2021 at 19:33

answered Sep 16, 2021 at 19:02

SeaBean

23.4k3 gold badges16 silver badges28 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Muhammad Hassan Over a year ago

this is just perfect

Ali Crash Over a year ago

perfect answer, very useful tips , thanks!

SeaBean Over a year ago

@Muhammadhassan Thanks!

SeaBean Over a year ago

@AliCrash Thanks!

sammywemmy · Accepted Answer · 2021-09-17 05:25:02Z

1

I feel you get better performance, and potentially easier manipulation, if you deal with python native data structures outside Pandas, before pulling the final form into Pandas:

Let's flatten the nested dictionary into a single dictionary, using Python's tools:

container = []
for key, value in j.items(): # j is the main dictionary
    for entry in value:
        content = {'date': entry['date'], 
                   'key': key, 
                   # expand the nested constituent data
                   # this gets us a single dictionary 
                   **entry['constituents']}
        container.append(content)

print(container)
[{'date': '2021-06-22',
  'key': 'ALPHA',
  'BBB': 0,
  'EEE': 1,
  'BTB': 1,
  'YUY': 1},
 {'date': '2021-09-07',
  'key': 'ALPHA',
  'BBB': 0,
  'EEE': 0,
  'BTB': 0,
  'YUY': 0},
 {'date': '2021-06-22', 'key': 'BETA', 'BBB': 1, 'EEE': 1, 'BTB': 1, 'YUY': 1},
 {'date': '2021-09-07', 'key': 'BETA', 'BBB': 1, 'EEE': 1, 'BTB': 1, 'YUY': 1},
 {'date': '2021-06-22',
  'key': 'THETA',
  'BBB': 0,
  'EEE': 1,
  'BTB': 1,
  'YUY': 0},
 {'date': '2021-08-20',
  'key': 'THETA',
  'BBB': 0,
  'EEE': 1,
  'BTB': 1,
  'YUY': 0},
 {'date': '2021-09-07',
  'key': 'THETA',
  'BBB': 0,
  'EEE': 1,
  'BTB': 1,
  'YUY': 0}]

Now, build the dataframe, and set the required columns as index:

pd.DataFrame(container).set_index(['date', 'key'])

                  BBB  EEE  BTB  YUY
date       key
2021-06-22 ALPHA    0    1    1    1
2021-09-07 ALPHA    0    0    0    0
2021-06-22 BETA     1    1    1    1
2021-09-07 BETA     1    1    1    1
2021-06-22 THETA    0    1    1    0
2021-08-20 THETA    0    1    1    0
2021-09-07 THETA    0    1    1    0

edited Sep 17, 2021 at 5:25

answered Sep 16, 2021 at 23:50

sammywemmy

28.9k4 gold badges21 silver badges35 bronze badges

3 Comments

jezrael Over a year ago

Maybe is posiibble add timings? I think this is very fast, I like this solution.

sammywemmy Over a year ago

yes, it is faster; the data is a small size though, so not sure if speed tests is justified. And I don't know how to generate nested dictionaries that match the OP's form

jezrael Over a year ago

Ya, maybe in data sample should be ouput also interesting (but necessary state is necessary also test in real for OP)

Collectives™ on Stack Overflow

Pandas: reading multi-index JSON as pandas data frame

2 Answers 2

4 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related