2

I run the following code and it outputs the json below.

import requests
url="xxxx"
r = requests.request("GET", url, headers=headers, data=payload)
j=r.json()

recs = j['collection']

Json

{'stationCode': 'NB001',
       'summaries': [{'period': {'year': 2017}, 'rainfall': 449},
        {'period': {'year': 2018}, 'rainfall': 352.4},
        {'period': {'year': 2019}, 'rainfall': 253.2},
        {'period': {'year': 2020}, 'rainfall': 283},
        {'period': {'year': 2021}, 'rainfall': 104.2}]},{'stationCode': 'NA003','summaries': [{'period': {'year': 2019}, 'rainfall': 58.2},{'period': {'year': 2020}, 'rainfall': 628.2},{'period': {'year': 2021}, 'rainfall': 120}]}

I need this output as follows into a table

enter image description here

Tried the following and I still could extract table with multiple added lines but just wondered if there was a faster way

df = json_normalize(recs)
df
3
  • the data shared is a tuple of dicts. if you could share the proper json form, that would be better Commented Mar 30, 2021 at 5:31
  • I see what you mean. Unless I give you the url to scrape. Equating it to a variable makes it a tuple? Commented Mar 30, 2021 at 5:38
  • 1
    the comma actually makes it a tuple. looks like you copied a small part of it (which is fine though) Commented Mar 30, 2021 at 5:39

2 Answers 2

2

You could iterate the json_normalize for each entry in the tuple (the data you shared is a tuple of dicts):

from pandas import json_normalize
In [333]: pd.concat([json_normalize(entry, 'summaries', 'stationCode') 
                     for entry in recs])
Out[333]: 
   rainfall  period.year stationCode
0     449.0         2017       NB001
1     352.4         2018       NB001
2     253.2         2019       NB001
3     283.0         2020       NB001
4     104.2         2021       NB001
0      58.2         2019       NA003
1     628.2         2020       NA003
2     120.0         2021       NA003

Alternative solution: I like jmespath, as it can be quite helpful for some gnarly nested options in json. The short story for using jmespath (it is sort of a language of its own with loads of functions) is if you are accessing a key, then the dot comes in to play, if it is a list, then the [] symbol is used:

import jmespath
expression = jmespath.compile("""{stationcode:stationCode, 
                                  year: summaries[].period.year, 
                                  rainfall: summaries[].rainfall}""")

outcome = [pd.DataFrame(expression.search(entry)) for entry in recs]
pd.concat(outcome)

  stationcode  year  rainfall
0       NB001  2017     449.0
1       NB001  2018     352.4
2       NB001  2019     253.2
3       NB001  2020     283.0
4       NB001  2021     104.2
0       NA003  2019      58.2
1       NA003  2020     628.2
2       NA003  2021     120.0

Just an arsenal in your tool, if json_normalize does not quite cut it. For raw speed, the built-in dict is king.

Sign up to request clarification or add additional context in comments.

13 Comments

Good one @sammywemmy
Nice answer. But this is slower than the solution when parsing recs.
didnt bother to check speed; besides json_normalize is more of a convenience feature. if you dont mind adding the speed to your solution; would be great and a good learning point
@sammywemmy, happy to post a question. Just dont want to put it out there because answers can be quite low quality. If I have the following. How do I get each entity as a column? You seemed to have fair bit of arsenals
[{"id": "1617844808_49195291","trip_update": {"trip": {"trip_id": "49195291","direction_id": 0,"route_id": "100001","start_date": "20210407","schedule_relationship": "SCHEDULED"},"stop_time_update": [{"stop_sequence": 101,"stop_id": "2390","arrival": {"delay": -492,"time": 1617844044 },"departure": {"delay": -492,"time": 1617844044},"schedule_relationship": "SCHEDULED"},{"stop_sequence": 110,"stop_id": "2400","arrival": {"delay": -492, "time": 1617844093},"departure": {"delay": -492,"time": 1617844093},"schedule_relationship": "SCHEDULED"}]}}]
|
1

Use:

In [956]: def f2():
     ...:     df = pd.DataFrame(recs)
     ...:     df = df.explode('summaries')
     ...:     df['year'] = df.summaries.str.get('period').str.get('year')
     ...:     df['rainfall'] = df.summaries.str.get('rainfall')
     ...:     df.drop('summaries', 1, inplace=True)
     ...: 

In [908]: df
Out[908]: 
  stationCode  year  rainfall
0       NB001  2017     449.0
0       NB001  2018     352.4
0       NB001  2019     253.2
0       NB001  2020     283.0
0       NB001  2021     104.2
1       NA003  2019      58.2
1       NA003  2020     628.2
1       NA003  2021     120.0

OR:

Parse the recs dict separately, which should be more efficient:

In [952]: def f1():
     ...:     s = []
     ...:     y = []
     ...:     r = []
     ...:     for k,v in recs[0].items():
     ...:         if k == 'stationCode':
     ...:             s.append(v)
     ...:     else:
     ...:         for i in v:
     ...:             y.append(i['period']['year'])
     ...:             r.append(i['rainfall'])
     ...:     s = s * len(y)
     ...:     df = pd.DataFrame({'stationCode': s, 'year': y, 'rainfall':r})

Timings for both:

In [955]: %timeit f1()
385 µs ± 20.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [969]: %timeit f2()
3.48 ms ± 224 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

5 Comments

+1, However, thats what I meant. Could get it done with a couple of additional lines. Any straight up method?
No straight up method as such. One other way could be parsing recs to get a straight-forward dict instead of parsing columns in df.
@wwnde Please check my updated answer. Have put timings as well.
Thanks @Mayank Porwal, Wemmy's answer is all encampassing. unpacks nested json lists as well. Thanks for your help
Sure @wwnde. No problem

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.