How to read a large Json file in python to fetch certain values

Question

I have a large Json file in the form of list of lists. This is airport codes and its mapping with its city, country, lat, long etc values. Here is a sample of it looks:

[["Goroka", "Goroka", "Papua New Guinea", "GKA", "AYGA", "-6.081689", "145.391881", "5282", "10", "U", "Pacific/Port_Moresby"], ["Asaba Intl", "Asaba", "Nigeria", "ABB", "DNAS", "6.2033333", "6.6588889", "0", "1", "U", "Africa/Lagos"], ["Downtown Airpark", "Oklahoma", "United States", "DWN", "", "35.4491997", "-97.5330963", "3240", "-6", "U", "America/Chicago"], ["Mbeya", "Mbeya", "Tanzania", "MBI", "HTMB", "-8.9169998", "33.4669991", "4921", "3", "U", "Africa/Dar_es_Salaam"], ["Tazadit", "Zouerate", "Mauritania", "OUZ", "GQPZ", "22.7563992", "-12.4835997", "", "0", "U", "Africa/Nouakchott"], ["Wadi Al-Dawasir", "Wadi al-Dawasir", "Saudi Arabia", "WAE", "OEWD", "20.5042992", "45.1996002", "10007", "3", "U", "Asia/Riyadh"], ["Madang", "Madang", "Papua New Guinea", "MAG", "AYMD", "-5.207083", "145.7887", "20", "10", "U", "Pacific/Port_Moresby"], ["Mount Hagen", "Mount Hagen", "Papua New Guinea", "HGU", "AYMH", "-5.826789", "144.295861", "5388", "10", "U", "Pacific/Port_Moresby"], ["Nadzab", "Nadzab", "Papua New Guinea", "LAE", "AYNZ", "-6.569828", "146.726242", "239", "10", "U", "Pacific/Port_Moresby"], ["Port Moresby Jacksons Intl", "Port Moresby", "Papua New Guinea", "POM", "AYPY", "-9.443383", "147.22005", "146", "10", "U", "Pacific/Port_Moresby"]

Each list is of form:

['name', 'city', 'country', 'iata', 'icao', 'lat', 'lon', 'alt', 'tz', 'dst', 'tzdb']

Where I am concerned with 'iata' and 'country' values of the list.

The code aims to provide a string variable with a particular iata code and then I want to read this json file, look up the list where that iata code appears and fetch the corresponding 'country' value from it.

This file would have most of the airport codes in world so while not in 10s of GB, it still has lots of lists.

I have this way to read the json in python:

import json

with open('airport_list.json','r') as airport_list:
    airport_dict = json.loads(airport_list.read())

Problem is this will load whole json in memory. I can try iterating it over the json iterator which would read line by line, but then how will I look up a string variable with iata code to a particular list in json?

Is there a better and efficient way to do this?

Are you doing a lot of lookups on the same data?

Hans Musgrave
– Hans Musgrave

2020-06-27 06:56:14 +00:00
Commented Jun 27, 2020 at 6:56 — Hans Musgrave
– Hans Musgrave, Commented Jun 27, 2020 at 6:56
Have a look into, pandas

sushanth
– sushanth

2020-06-27 07:05:23 +00:00
Commented Jun 27, 2020 at 7:05 — sushanth
– sushanth, Commented Jun 27, 2020 at 7:05

captnsupremo · Accepted Answer · 2020-06-27 08:37:32Z

In order to find the list in this json that contains a specific 'iata', you can iterate through it as a text file in byte-chunks, parsing each chunk to see if it has what you need.

Unfortunately, if the 'iata' occurs near the end of the list, then you'll still have to read your way through the whole file, although it won't all be in-memory at once.

If this is a lookup that you need to do many times, it would probably be worth it to generate a dict with the iatas as keys and countries as values. Because dictionary are hash tables, performing this sort of lookup is a very efficient task, and you'd significantly decrease the file size by only using the two elements iata and country.

Nevertheless, if I haven't dissuaded you from this course, here are functions that should parse this json as a text file in chunks, and return the country code from the iata, assuming that iatas are unique.

def read_in_chunks(file_object, chunk_size):
    while True:
        data = file_object.read(chunk_size)
        if not data:
            break
        yield data

def parse_chunk(chunk, iata):
    if iata in chunk:
        pieces = [x.strip() for x in chunk.split(',')]
        if pieces[0] != iata:
            if pieces[pieces.index(f'"{iata}"')-1].startswith('"'):
                return pieces[pieces.index(f'"{iata}"')-1].replace('"', '')
            else:
                return "fragment"
        else:
            return None


def country_from_iata(iata):    
    count = 0
    
    # attempt to find the element immediately prior to the iata
    with open('example.json', 'rt') as f:
        for chunk in read_in_chunks(f, 64):
            parsed = parse_chunk(chunk, iata)
            if parsed:
                break
            count += 64
    
    # if the element was split, then shift half an iteration to the left.
    if parsed == "fragment":
        with open('example.json', 'rt') as f:
            f.seek(count-32)
            for chunk in read_in_chunks(f, 64):
                parsed = parse_chunk(chunk, iata)
                if parsed:
                    break
    
    return parsed

country_from_iata("LAE") # 'Papua New Guinea'

kundan · Accepted Answer · 2020-06-27 08:48:37Z

If the objective is to avoid loading the whole file onto memory, then it can be done in one of the following ways:

Use Ijson which is "an iterative JSON parser with standard Python iterator interfaces."
Use a document DB to dump the json file and then read from it. You could use TinyDB for that.
Or you could read and process it in chunks, something like this:

from functools import partial

def custom_operation(text):
  """
  TODO: Find last '],' , process text before '],' to
  find the names and return the text after it as residual
  """
  matches, residual = [], residual
  return matches, residual

def readfile(filename)
  with open(filename, 'r') as fh:
      filepart = partial(fh.read, 1024*1024)
      iterator = iter(filepart, b'')

      residual = ''
      for index, block in enumerate(iterator, start=1):
        matches, residual = custom_operation('%s%s' % (residual, block))
        yield matches

Hope that helps!

ComedicChimera · Accepted Answer · 2020-06-27 07:01:07Z

-1

I would personally recommend the library pandas for this kind of task. It has a built-in function for reading JSON (read_json) and tends to be more efficient than the standard library JSON offerings. Moreover, you can customize it pretty heavily for your exact use case.

Here is a reference to the Pandas read_json function: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html.

answered Jun 27, 2020 at 7:01

ComedicChimera

4761 gold badge5 silver badges15 bronze badges

2 Comments

Baktaawar Over a year ago

but even pandas is in memory. It will store all data in memory

ComedicChimera Over a year ago

True, but it will do so in a much more compressed form.

Collectives™ on Stack Overflow

How to read a large Json file in python to fetch certain values

3 Answers 3

Comments

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related