Optimize parsing file with JSON objects in pandas dataframe, where keys may be missing in some rows

Question

I'm looking to optimize the code below which takes ~5 seconds, which is too slow for a file of only 1000 lines.

I have a large file where each line contains valid JSON, with each JSON looking like the following (the actual data is much larger and nested, so I use this JSON snippet for illustration):

  {"location":{"town":"Rome","groupe":"Advanced",
    "school":{"SchoolGroupe":"TrowMet", "SchoolName":"VeronM"}},
    "id":"145",
    "Mother":{"MotherName":"Helen","MotherAge":"46"},"NGlobalNote":2,
    "Father":{"FatherName":"Peter","FatherAge":"51"},
    "Teacher":["MrCrock","MrDaniel"],"Field":"Marketing",
     "season":["summer","spring"]}

I need to parse this file in order to extract only some key-values from every JSON, to obtain the resulting dataframe:

Groupe      Id   MotherName   FatherName
Advanced    56   Laure         James
Middle      11   Ann           Nicolas
Advanced    6    Helen         Franc

But some keys I need in the dataframe, are missing in some JSON objects, so I should to verify if the key is present, and if not, fill the corresponding value with Null. I use with the following method:

df = pd.DataFrame(columns=['group', 'id', 'Father', 'Mother'])
with open (path/to/file) as f:
    for chunk in f:
        jfile = json.loads(chunk)

        if 'groupe' in jfile['location']:
            groupe = jfile['location']['groupe']
        else:
            groupe=np.nan

        if 'id' in jfile:
            id = jfile['id']
        else:
            id = np.nan

        if 'MotherName' in jfile['Mother']:
            MotherName = jfile['Mother']['MotherName']
        else:
            MotherName = np.nan

        if 'FatherName' in jfile['Father']:
            FatherName = jfile['Father']['FatherName']
        else: 
            FatherName = np.nan

        df = df.append({"groupe":group, "id":id, "MotherName":MotherName, "FatherName":FatherName},
            ignore_index=True)

I need to optimize the runtime over the whole 1000-row file to <= 2 seconds. In PERL the same parsing function takes < 1 second, but I need to implement it in Python.

You should use Python's dict.get(key, default) method which optionally takes a default value for when key is not found. That also makes your inner-loop code 4x more compact and legible. But you could probably use dict.update or defaultdict to reduce even further. — smci
– smci, Commented May 25, 2019 at 0:27

tdelaney · Accepted Answer · 2016-02-26 07:21:05Z

2

You'll get the best performance if you can build the dataframe in a single step during initialization. DataFrame.from_record takes a sequence of tuples which you can supply from a generator that reads one record at a time. You can parse the data faster with get, which will supply a default parameter when the item isn't found. I created an empty dict called dummy to pass for intermediate gets so that you know a chained get will work.

I created a 1000 record dataset and on my crappy laptop the time went from 18 seconds to .06 seconds. Thats pretty good.

import numpy as np
import pandas as pd
import json
import time

def extract_data(data):
    """ convert 1 json dict to records for import"""
    dummy = {}
    jfile = json.loads(data.strip())
    return (
        jfile.get('location', dummy).get('groupe', np.nan), 
        jfile.get('id', np.nan),
        jfile.get('Mother', dummy).get('MotherName', np.nan),
        jfile.get('Father', dummy).get('FatherName', np.nan))

start = time.time()
df = pd.DataFrame.from_records(map(extract_data, open('file.json')),
    columns=['group', 'id', 'Father', 'Mother'])
print('New algorithm', time.time()-start)

#
# The original way
#

start= time.time()
df=pd.DataFrame(columns=['group', 'id', 'Father', 'Mother'])
with open ('file.json') as f:
      for chunk in f:
           jfile=json.loads(chunk)
           if 'groupe' in jfile['location']:
               groupe=jfile['location']['groupe']
           else:
               groupe=np.nan
           if 'id' in jfile:
                id=jfile['id']
           else:
                id=np.nan
           if 'MotherName' in jfile['Mother']:
                MotherName=jfile['Mother']['MotherName']
           else:
                MotherName=np.nan
           if 'FatherName' in jfile['Father']:
                FatherName=jfile['Father']['FatherName']
           else: 
                FatherName=np.nan
           df = df.append({"groupe":groupe,"id":id,"MotherName":MotherName,"FatherName":FatherName},
            ignore_index=True)
print('original', time.time()-start)

answered Feb 26, 2016 at 7:21

tdelaney

78k6 gold badges91 silver badges129 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Amanda Over a year ago

I have AttributeError: 'list' object has no attribute 'get' with this method! Don't forget that I have a file with json oon each line, maybe this is a problem. So I need iterate over the lines to parse each json

Amanda Over a year ago

so the whole file is not json itself, but every line of this file is a valid json

Amanda Over a year ago

it works except of the case when instead of dictionary there is a nested json! how to use .get method in this case? @tdelaney

tdelaney Over a year ago

I'm not sure what "nested json" means. Is it a single json-encoded string? Perhaps you could decode it and replace the string with the decoded structure.

smci Over a year ago

Amanda please edit your question to add example data for those corner-cases. Parsing issues in JSON are a real pain to reproduce... ;-)

Alexander · Accepted Answer · 2016-02-26 06:39:58Z

1

The key part is not to append each row to the dataframe in the loop. You want to keep the collection in a list or dict container and then concatenate all of them at once. You can also simplify your if/else structure with a simple get that returns a default value (e.g. np.nan) if the item is not found in the dictionary.

with open (path/to/file) as f:
    d = {'group': [], 'id': [], 'Father': [], 'Mother': []}
    for chunk in f:
        jfile = json.loads(chunk)
        d['groupe'].append(jfile['location'].get('groupe', np.nan))
        d['id'].append(jfile.get('id', np.nan))
        d['MotherName'].append(jfile['Mother'].get('MotherName', np.nan))
        d['FatherName'].append(jfile['Father'].get('FatherName', np.nan))

    df = pd.DataFrame(d)

answered Feb 26, 2016 at 6:39

Alexander

111k32 gold badges212 silver badges208 bronze badges

2 Comments

Amanda Over a year ago

Your answer is good n but there is an error TypeError: list indices must be integers, not str while converting dictionary into pandas dataframe

Alexander Over a year ago

Sounds like there might be a problem with the data. Try to make a DataFrame from each column and see if you can isolate the issue.

Collectives™ on Stack Overflow

Optimize parsing file with JSON objects in pandas dataframe, where keys may be missing in some rows

2 Answers 2

5 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related