I'm looking to optimize the code below which takes ~5 seconds, which is too slow for a file of only 1000 lines.
I have a large file where each line contains valid JSON, with each JSON looking like the following (the actual data is much larger and nested, so I use this JSON snippet for illustration):
{"location":{"town":"Rome","groupe":"Advanced",
"school":{"SchoolGroupe":"TrowMet", "SchoolName":"VeronM"}},
"id":"145",
"Mother":{"MotherName":"Helen","MotherAge":"46"},"NGlobalNote":2,
"Father":{"FatherName":"Peter","FatherAge":"51"},
"Teacher":["MrCrock","MrDaniel"],"Field":"Marketing",
"season":["summer","spring"]}
I need to parse this file in order to extract only some key-values from every JSON, to obtain the resulting dataframe:
Groupe Id MotherName FatherName
Advanced 56 Laure James
Middle 11 Ann Nicolas
Advanced 6 Helen Franc
But some keys I need in the dataframe, are missing in some JSON objects, so I should to verify if the key is present, and if not, fill the corresponding value with Null. I use with the following method:
df = pd.DataFrame(columns=['group', 'id', 'Father', 'Mother'])
with open (path/to/file) as f:
for chunk in f:
jfile = json.loads(chunk)
if 'groupe' in jfile['location']:
groupe = jfile['location']['groupe']
else:
groupe=np.nan
if 'id' in jfile:
id = jfile['id']
else:
id = np.nan
if 'MotherName' in jfile['Mother']:
MotherName = jfile['Mother']['MotherName']
else:
MotherName = np.nan
if 'FatherName' in jfile['Father']:
FatherName = jfile['Father']['FatherName']
else:
FatherName = np.nan
df = df.append({"groupe":group, "id":id, "MotherName":MotherName, "FatherName":FatherName},
ignore_index=True)
I need to optimize the runtime over the whole 1000-row file to <= 2 seconds. In PERL the same parsing function takes < 1 second, but I need to implement it in Python.
dict.get(key, default)method which optionally takes adefaultvalue for whenkeyis not found. That also makes your inner-loop code 4x more compact and legible. But you could probably usedict.updateordefaultdictto reduce even further.