0

My script looks like this:

with open('toy.json', 'rb') as inpt:

    lines = [json.loads(line) for line in inpt]

    for line in lines: 
        records = [item['hash'] for item in lines]
    for item in records: 
        print item

What it does is read in data where each line is valid JSON, but the file as a whole is not valid JSON. The reason for that is because it's an aggregated dump from a web service.

The data looks, more or less, like this:

{"record":"value0","block":"0x79"} 
{"record":"value1","block":"0x80"} 

So the code above works, it allows me to interact with the data as JSON, but it's so slow that it's essentially useless.

Is there a good way to speed up this process?

EDIT:

with open('toy.json', 'rb') as inpt:

    for line in inpt:
        print("identifier: "+json.loads(line)['identifier'])
        print("value:  "+json.loads(line)['value'])

EDIT II:

for line in inpt:
    resource = json.loads(line)
    print(resource['identifier']+", "+resource['value'])
3
  • 1
    Why do you construct records = [item['hash'] for item in lines] for each line? Commented Sep 16, 2017 at 17:50
  • so I can access the item by it's JSON identifier and also so I can iterate over the whole file Commented Sep 16, 2017 at 17:51
  • but you never use line in the list comprehension, an in the list comprehension, you iterate already over lines again. Commented Sep 16, 2017 at 17:53

2 Answers 2

2

You write:

for line in lines: 
    records = [item['hash'] for item in lines]

But this means that you will construct that records list n times (with n the number of lines). This is useless, and makes the time complexity O(n2).

You can speed this up with:

with open('toy.json', 'rb') as inpt:

    for item in [json.loads(line)['hash'] for line in inpt]:
        print item

Or you can reduce the memory burden, by each time priting the hash when you process a line:

with open('toy.json', 'rb') as inpt:

    for line in inpt:
        print json.loads(line)['hash']
Sign up to request clarification or add additional context in comments.

10 Comments

excellent. thank you for these insights. Can't accept the answer till 9 minutes later, but I will as soon as I can.
this json.loads(line)['gas'] for line in inpt is totally awesome- do you have some resource where I can learn about that?
in the edit I made some way of printing an identifier along with the value, do you think it's an ok approach?
@s.matthew.english: see list comprehension.
@s.matthew.english: yes, in that case it is fine.
|
1

If all you want to do is print and you are dealing with massive files you can split your file into n evenly sized chunks where n == number of cores in your CPU.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.