Speed up execution time of JSON-ification and processing of data with Python

Question

My script looks like this:

with open('toy.json', 'rb') as inpt:

    lines = [json.loads(line) for line in inpt]

    for line in lines: 
        records = [item['hash'] for item in lines]
    for item in records: 
        print item

What it does is read in data where each line is valid JSON, but the file as a whole is not valid JSON. The reason for that is because it's an aggregated dump from a web service.

The data looks, more or less, like this:

{"record":"value0","block":"0x79"} 
{"record":"value1","block":"0x80"}

So the code above works, it allows me to interact with the data as JSON, but it's so slow that it's essentially useless.

Is there a good way to speed up this process?

EDIT:

with open('toy.json', 'rb') as inpt:

    for line in inpt:
        print("identifier: "+json.loads(line)['identifier'])
        print("value:  "+json.loads(line)['value'])

EDIT II:

for line in inpt:
    resource = json.loads(line)
    print(resource['identifier']+", "+resource['value'])

Why do you construct records = [item['hash'] for item in lines] for each line? — willeM_ Van Onsem
– willeM_ Van Onsem, Commented Sep 16, 2017 at 17:50
so I can access the item by it's JSON identifier and also so I can iterate over the whole file — smatthewenglish
– smatthewenglish, Commented Sep 16, 2017 at 17:51
but you never use line in the list comprehension, an in the list comprehension, you iterate already over lines again. — willeM_ Van Onsem
– willeM_ Van Onsem, Commented Sep 16, 2017 at 17:53

willeM_ Van Onsem · Accepted Answer · 2017-09-16 17:52:26Z

2

You write:

for line in lines: 
    records = [item['hash'] for item in lines]

But this means that you will construct that records list n times (with n the number of lines). This is useless, and makes the time complexity O(n²).

You can speed this up with:

with open('toy.json', 'rb') as inpt:

    for item in [json.loads(line)['hash'] for line in inpt]:
        print item

Or you can reduce the memory burden, by each time priting the hash when you process a line:

with open('toy.json', 'rb') as inpt:

    for line in inpt:
        print json.loads(line)['hash']

answered Sep 16, 2017 at 17:52

willeM_ Van Onsem

482k33 gold badges483 silver badges624 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

smatthewenglish Over a year ago

excellent. thank you for these insights. Can't accept the answer till 9 minutes later, but I will as soon as I can.

smatthewenglish Over a year ago

this json.loads(line)['gas'] for line in inpt is totally awesome- do you have some resource where I can learn about that?

smatthewenglish Over a year ago

in the edit I made some way of printing an identifier along with the value, do you think it's an ok approach?

willeM_ Van Onsem Over a year ago

@s.matthew.english: see list comprehension.

willeM_ Van Onsem Over a year ago

@s.matthew.english: yes, in that case it is fine.

|

Andres De Castro · Accepted Answer · 2017-09-23 02:02:27Z

1

If all you want to do is print and you are dealing with massive files you can split your file into n evenly sized chunks where n == number of cores in your CPU.

answered Sep 23, 2017 at 2:02

Andres De Castro

611 bronze badge

Collectives™ on Stack Overflow

Speed up execution time of JSON-ification and processing of data with Python

2 Answers 2

10 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

10 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related