Iterating over JSON more effectively

Question

Problem: I need to get data from JSON file containing information about "contributors". Each contributor has an atrribute jobs, which is a list of string-like job positions. The program should print five most popular jobs (in the whole dataset) and assign an attribute top_job to each contributor with a job position of his which is the most frequent in the whole dataset. Use of as little extra libraries (excluding json) as possible is needed. I will greatly appreciate if anyone can suggest as to how the program might be improved in terms of efficiency! Thanks in advance!

Sample input:

[{'username': 'bartonmichelle',
  ...
  'jobs': ['Teacher, special educational needs',
   'Water engineer',
   'Intelligence analyst',
   'Automotive engineer',
   'Geoscientist'],
  'id': 173012},
 {'username': 'ahardin',
  ...
  'jobs': ['Water engineer',
   'Private music teacher',
   'Administrator',
   'Television camera operator'],
  'id': 113928}]

Sample output:

[{'username': 'bartonmichelle',
  ...
  'jobs': ['Teacher, special educational needs',
   'Water engineer',
   'Intelligence analyst',
   'Automotive engineer',
   'Geoscientist'],
  'id': 173012,
  'top_job': 'Water engineer'}, # top job added based on job's frequency
 {'username': 'ahardin',
  ...
  'jobs': ['Water engineer',
   'Private music teacher',
   'Administrator',
   'Television camera operator'],
  'id': 113928,
  'top_job': 'Water engineer'}] # top job added based on job's frequency

My approach:

from collections import Counter

jobs = []
with open('contributors_sample.json','r',encoding="utf-8") as f:
  contributors_file = json.load(f)
  for contributor in contributors_file:
    jobs.extend(contributor['jobs'])

sorted_jobs = list(map(
    lambda sorted_arg: sorted_arg[0],
    sorted(
      Counter(jobs).items(),
      key=lambda tupleobj: tupleobj[1],
      reverse=True
    )
))

for contributor in contributors_file:
  contributors_jobs = contributor['jobs']
  top_job = contributors_jobs[0]
  for job in contributors_jobs[1:]:
    if sorted_jobs.index(job) < sorted_jobs.index(top_job):
      top_job = job
  contributor['top_job'] = top_job

contributors_file

Current execution time: 0.110848s

The data pipeline itself seems suspect. Why are you adding new elements to an existing serialised format? Where do the data come from, and where are they going? — Reinderien
– Reinderien, Commented Sep 24, 2022 at 14:24
Does the program actually print five most popular jobs (in the whole dataset)? Sure seems like it doesn't. — Reinderien
– Reinderien, Commented Sep 24, 2022 at 14:43

Reinderien · Accepted Answer · 2022-09-24 15:26:26Z

Much more complicated than it needs to be. You extend a list, then traverse the list to construct a counter, then traverse again to get a sorted sequence, then traverse again to get a list of keys only; this all needs to go away - especially the lambda/map style which is better expressed with comprehensions.

Instead, work with your Counter as a first-class citizen. Don't juggle indices. And spend some quality time reading the Counter documentation.

Suggested

import json
from collections import Counter
from pprint import pprint

with open('contributors_sample.json') as f:
    contributors_file = json.load(f)

jobs = Counter()
for contributor in contributors_file:
    jobs.update(contributor['jobs'])

print('Top jobs:', jobs.most_common(5))

for contributor in contributors_file:
    top_freq, contributor['top_job'] = max(
        (jobs[job], job)
        for job in contributor['jobs']
    )

pprint(contributors_file)

As a more direct and obscure alternative, the top_job assignment can be written as

    contributor['top_job'] = max(
        contributor['jobs'], key=jobs.__getitem__,
    )

Stack Exchange Network

Iterating over JSON more effectively

1 Answer 1

Suggested

You must log in to answer this question.

Hot Network Questions

Iterating over JSON more effectively

1 Answer 1

Suggested

You must log in to answer this question.

Related

Hot Network Questions