how to use GZIP to compress JSON data in python program?

Question

I have an AWS Kinesis python program - Producer to send data to my stream. But my JSON file is 5MB. I would like to compress the data using GZIP or any other best methods. My producer code is like this :

import boto3
import json
import csv
from datetime import datetime
import calendar
import time
import random



# putting data to Kinesis

my_stream_name='ApacItTeamTstOrderStream'

kinesis_client=boto3.client('kinesis',region_name='us-east-1')


with open('output.json', 'r') as file:
    for line in file:
        put_response=kinesis_client.put_record(
            StreamName=my_stream_name,
            Data=line,
            PartitionKey=str(random.randrange(3000)))
    
        print(put_response)

my requirement is :

I need to compress this data and then pushed the compressed data to Kinesis after pushing this data, when we consume this, we need to decompress it...

Since I am very new to this, can someone guide me or suggest to me what kind of programs I should add in the existing code?

Look at the zlib and gzip modules.

nog642
– nog642

2020-07-14 03:07:41 +00:00
Commented Jul 14, 2020 at 3:07 — nog642
– nog642, Commented Jul 14, 2020 at 3:07

Nishit · Accepted Answer · 2020-07-14 04:44:05Z

3

There are 2 ways in which you can compress the data :

1. Enable GZIP/Snappy compression on Firehose Stream - This can be done via Console itself

Firehose buffers the data and after the treshold is reached, it takes all the data and compresses it together to create the gz object.

Pros :

Minimal Effort Required on Producer side - Just change the setting in console.
Minimal Effort Required on Consumer Side - Firehose creates .gz objects in S3 and sets the metadata on the objects to reflect the compression type. Hence, if you read the data via AWS SDK itself, the SDK will do the decompression for you.

Cons :

Since firehose charges on size of data ingested, you will not be saving on Firehose cost. You will save on S3 cost (due to smaller size of objects).

2. Compression by Producer code - Need to write the code

I implemented this in Java a few days back. We were ingesting over 100 Petabytes of data into Firehose (from where it gets written to S3). This was a massive cost for us.

So, we decided to do the compression on Producer side. This results in compressed data flowing to KF which is as is written to S3. Please note that since KF is not compressing it, it has no idea what data it is. As a result, the objects created in s3 don't have ".gz" compression. Hence, the consumers are none the wiser as to what data is in the objects. We then wrote a wrapper on top of AWS Java SDK for S3 which reads the object and decompresses it.

Pros :

Our compression factor was close to 90%. That directly resulted in a 90% savings on Firehose cost. Plus the additional savings of S3 as in approach 1.

Cons :

Not exactly a con, but more developmental effort would be required. To create a wrapper on top of the AWS SDK, testing effort etc.
Compression & Decompression are CPU intensive. On an average, the 2 together increased our CPU utilization by 22%.

answered Jul 14, 2020 at 4:44

Nishit

1,3642 gold badges11 silver badges27 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

binu456m Over a year ago

Thanks Alot Nishit... your answer is well explained.. i have been instructed to use "Compression" in the producer code. i am trying to use Gzip or other methods. Since i write my code in python, i will check if any codes are availble. since this is the first time i need to use compressor method, i am totally unaware about the codes

Nishit Over a year ago

Python does support Compression & Decompression.

binu456m Over a year ago

are you sure about that? because i could see many were using GZIP module in python to compress or decompress it?

Nishit Over a year ago

Yes, we are to use the module to do it. What's the issue there?

binu456m Over a year ago

do you have any sample code for pushing Gzip file to kinesis?

|

Collectives™ on Stack Overflow

how to use GZIP to compress JSON data in python program?

1 Answer 1

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related