JSON to pandas dataframe with nested lists

Question

I run the following code and it outputs the json below.

import requests
url="xxxx"
r = requests.request("GET", url, headers=headers, data=payload)
j=r.json()

recs = j['collection']

Json

{'stationCode': 'NB001',
       'summaries': [{'period': {'year': 2017}, 'rainfall': 449},
        {'period': {'year': 2018}, 'rainfall': 352.4},
        {'period': {'year': 2019}, 'rainfall': 253.2},
        {'period': {'year': 2020}, 'rainfall': 283},
        {'period': {'year': 2021}, 'rainfall': 104.2}]},{'stationCode': 'NA003','summaries': [{'period': {'year': 2019}, 'rainfall': 58.2},{'period': {'year': 2020}, 'rainfall': 628.2},{'period': {'year': 2021}, 'rainfall': 120}]}

I need this output as follows into a table

Tried the following and I still could extract table with multiple added lines but just wondered if there was a faster way

df = json_normalize(recs)
df

the data shared is a tuple of dicts. if you could share the proper json form, that would be better — sammywemmy
– sammywemmy, Commented Mar 30, 2021 at 5:31
I see what you mean. Unless I give you the url to scrape. Equating it to a variable makes it a tuple? — wwnde
– wwnde, Commented Mar 30, 2021 at 5:38
the comma actually makes it a tuple. looks like you copied a small part of it (which is fine though) — sammywemmy
– sammywemmy, Commented Mar 30, 2021 at 5:39

sammywemmy · Accepted Answer · 2021-03-30 06:07:55Z

2

You could iterate the json_normalize for each entry in the tuple (the data you shared is a tuple of dicts):

from pandas import json_normalize
In [333]: pd.concat([json_normalize(entry, 'summaries', 'stationCode') 
                     for entry in recs])
Out[333]: 
   rainfall  period.year stationCode
0     449.0         2017       NB001
1     352.4         2018       NB001
2     253.2         2019       NB001
3     283.0         2020       NB001
4     104.2         2021       NB001
0      58.2         2019       NA003
1     628.2         2020       NA003
2     120.0         2021       NA003

Alternative solution: I like jmespath, as it can be quite helpful for some gnarly nested options in json. The short story for using jmespath (it is sort of a language of its own with loads of functions) is if you are accessing a key, then the dot comes in to play, if it is a list, then the [] symbol is used:

import jmespath
expression = jmespath.compile("""{stationcode:stationCode, 
                                  year: summaries[].period.year, 
                                  rainfall: summaries[].rainfall}""")

outcome = [pd.DataFrame(expression.search(entry)) for entry in recs]
pd.concat(outcome)

  stationcode  year  rainfall
0       NB001  2017     449.0
1       NB001  2018     352.4
2       NB001  2019     253.2
3       NB001  2020     283.0
4       NB001  2021     104.2
0       NA003  2019      58.2
1       NA003  2020     628.2
2       NA003  2021     120.0

Just an arsenal in your tool, if json_normalize does not quite cut it. For raw speed, the built-in dict is king.

edited Mar 30, 2021 at 6:07

answered Mar 30, 2021 at 5:30

sammywemmy

28.9k4 gold badges21 silver badges35 bronze badges

Sign up to request clarification or add additional context in comments.

13 Comments

wwnde Over a year ago

Good one @sammywemmy

Mayank Porwal Over a year ago

Nice answer. But this is slower than the solution when parsing recs.

sammywemmy Over a year ago

didnt bother to check speed; besides json_normalize is more of a convenience feature. if you dont mind adding the speed to your solution; would be great and a good learning point

wwnde Over a year ago

@sammywemmy, happy to post a question. Just dont want to put it out there because answers can be quite low quality. If I have the following. How do I get each entity as a column? You seemed to have fair bit of arsenals

wwnde Over a year ago

[{"id": "1617844808_49195291","trip_update": {"trip": {"trip_id": "49195291","direction_id": 0,"route_id": "100001","start_date": "20210407","schedule_relationship": "SCHEDULED"},"stop_time_update": [{"stop_sequence": 101,"stop_id": "2390","arrival": {"delay": -492,"time": 1617844044 },"departure": {"delay": -492,"time": 1617844044},"schedule_relationship": "SCHEDULED"},{"stop_sequence": 110,"stop_id": "2400","arrival": {"delay": -492, "time": 1617844093},"departure": {"delay": -492,"time": 1617844093},"schedule_relationship": "SCHEDULED"}]}}]

|

Mayank Porwal · Accepted Answer · 2021-03-30 05:06:10Z

1

Use:

In [956]: def f2():
     ...:     df = pd.DataFrame(recs)
     ...:     df = df.explode('summaries')
     ...:     df['year'] = df.summaries.str.get('period').str.get('year')
     ...:     df['rainfall'] = df.summaries.str.get('rainfall')
     ...:     df.drop('summaries', 1, inplace=True)
     ...: 

In [908]: df
Out[908]: 
  stationCode  year  rainfall
0       NB001  2017     449.0
0       NB001  2018     352.4
0       NB001  2019     253.2
0       NB001  2020     283.0
0       NB001  2021     104.2
1       NA003  2019      58.2
1       NA003  2020     628.2
1       NA003  2021     120.0

OR:

Parse the recs dict separately, which should be more efficient:

In [952]: def f1():
     ...:     s = []
     ...:     y = []
     ...:     r = []
     ...:     for k,v in recs[0].items():
     ...:         if k == 'stationCode':
     ...:             s.append(v)
     ...:     else:
     ...:         for i in v:
     ...:             y.append(i['period']['year'])
     ...:             r.append(i['rainfall'])
     ...:     s = s * len(y)
     ...:     df = pd.DataFrame({'stationCode': s, 'year': y, 'rainfall':r})

Timings for both:

In [955]: %timeit f1()
385 µs ± 20.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [969]: %timeit f2()
3.48 ms ± 224 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

edited Mar 30, 2021 at 5:06

answered Mar 30, 2021 at 4:36

Mayank Porwal

34.2k9 gold badges45 silver badges65 bronze badges

5 Comments

wwnde Over a year ago

+1, However, thats what I meant. Could get it done with a couple of additional lines. Any straight up method?

Mayank Porwal Over a year ago

No straight up method as such. One other way could be parsing recs to get a straight-forward dict instead of parsing columns in df.

Mayank Porwal Over a year ago

@wwnde Please check my updated answer. Have put timings as well.

wwnde Over a year ago

Thanks @Mayank Porwal, Wemmy's answer is all encampassing. unpacks nested json lists as well. Thanks for your help

Mayank Porwal Over a year ago

Sure @wwnde. No problem

Collectives™ on Stack Overflow

JSON to pandas dataframe with nested lists

2 Answers 2

13 Comments

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

13 Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related