Read multi object json gz file from S3 in python

Question

I have some files in a S3 bucket and I'm trying to read them in the fastest possible way. The file's format is gzip and inside it, there is a single multi object json file like this:

{"id":"test1", "created":"2020-01-01", "lastUpdated":"2020-01-01T00:00:00.000Z"}
{"id":"test2", "created":"2020-01-01", "lastUpdated":"2020-01-01T00:00:00.000Z"}

What I want to do is load the json file and read every single object and process it. After some research this is the only code that it worked for me

import json
import gzip
import boto3
from io import BytesIO

s3 = boto3.resource('s3')
bucket = s3.Bucket("my-bucket")

for obj in bucket.objects.filter(Prefix='my-prefix').all():
    buffer = BytesIO(obj.get()['Body'].read())
    gzipfile = gzip.GzipFile(fileobj=buffer)
    for line in gzipfile:
        json_object = json.loads(line)
        # some stuff with the json_object

Anyone knows a better way to read the json object?

Thanks for helping

Hyruma92 · Accepted Answer · 2021-06-09 09:12:03Z

2

After some researches, I found the library smart-open very useful and simply to use.

from smart_open import open
import json

s3_client = s3_session.client("s3")
source_uri = 's3://my-bucket/my-path'
for json_line in open(source_uri, transport_params={"client": s3_client}):
    my_json = json.loads(json_line)

It uses a stream so you don't need to keep in memory the entire file when you read it. Furthermore, it handles different extensions so I don't need to care about the gz decompression.

answered Jun 9, 2021 at 9:12

Hyruma92

9068 silver badges26 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

GameChanger Over a year ago

And if your ultimate goal is to create a pandas dataframe, just add one more line: my_df = pd.DataFrame(my_json)

Yevhen Kuzmovych · Accepted Answer · 2022-07-27 09:42:40Z

0

After you have the buffer, try the following

decompressed = gzip.decompress(buffer)
json_lines = json.loads(decompressed)
for json_obj in json_lines:
    # Do stuff

edited Jul 27, 2022 at 9:42

Yevhen Kuzmovych

12.3k8 gold badges32 silver badges54 bronze badges

answered Mar 30, 2020 at 23:23

ekmcd

1828 bronze badges

3 Comments

Hyruma92 Over a year ago

I try your solution but it returned with "TypeError: a bytes-like object is required, not '_io.BytesIO'". So I tried to remove the BytesIO, but I still receive an error "json.decoder.JSONDecodeError: Extra data: line 2 column 1 (char 76100)", due to the fact that the file contains multi json object I suppose

ekmcd Over a year ago

Actually, I reviewed you're question one more time and realized I was falsely assuming the zipped file is a json file. However, it is not, it is simply a file containing multiple json objects. Either use your existing working code (you have to json.loads each object/line separately), OR modify the files to be valid json e.g. [{...},{...},{...}] instead of {...} {...} {...} and then you will be able to load all at once.

ekmcd Over a year ago

Then the way you have described above is the fastest as although each line is valid json, the lines together without a containing array are not valid json and thus cannot be loaded all together using a utility for json objects so loading individually will have to do.

Collectives™ on Stack Overflow

Read multi object json gz file from S3 in python

2 Answers 2

1 Comment

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related