0

I have an AWS Kinesis python program - Producer to send data to my stream. But my JSON file is 5MB. I would like to compress the data using GZIP or any other best methods. My producer code is like this :

import boto3
import json
import csv
from datetime import datetime
import calendar
import time
import random



# putting data to Kinesis

my_stream_name='ApacItTeamTstOrderStream'

kinesis_client=boto3.client('kinesis',region_name='us-east-1')


with open('output.json', 'r') as file:
    for line in file:
        put_response=kinesis_client.put_record(
            StreamName=my_stream_name,
            Data=line,
            PartitionKey=str(random.randrange(3000)))
    
        print(put_response)

my requirement is :

I need to compress this data and then pushed the compressed data to Kinesis after pushing this data, when we consume this, we need to decompress it...

Since I am very new to this, can someone guide me or suggest to me what kind of programs I should add in the existing code?

1
  • Look at the zlib and gzip modules. Commented Jul 14, 2020 at 3:07

1 Answer 1

3

There are 2 ways in which you can compress the data :

1. Enable GZIP/Snappy compression on Firehose Stream - This can be done via Console itself

Firehose buffers the data and after the treshold is reached, it takes all the data and compresses it together to create the gz object.

Pros :

  • Minimal Effort Required on Producer side - Just change the setting in console.
  • Minimal Effort Required on Consumer Side - Firehose creates .gz objects in S3 and sets the metadata on the objects to reflect the compression type. Hence, if you read the data via AWS SDK itself, the SDK will do the decompression for you.

Cons :

  • Since firehose charges on size of data ingested, you will not be saving on Firehose cost. You will save on S3 cost (due to smaller size of objects).

2. Compression by Producer code - Need to write the code

I implemented this in Java a few days back. We were ingesting over 100 Petabytes of data into Firehose (from where it gets written to S3). This was a massive cost for us.

So, we decided to do the compression on Producer side. This results in compressed data flowing to KF which is as is written to S3. Please note that since KF is not compressing it, it has no idea what data it is. As a result, the objects created in s3 don't have ".gz" compression. Hence, the consumers are none the wiser as to what data is in the objects. We then wrote a wrapper on top of AWS Java SDK for S3 which reads the object and decompresses it.

Pros :

  • Our compression factor was close to 90%. That directly resulted in a 90% savings on Firehose cost. Plus the additional savings of S3 as in approach 1.

Cons :

  • Not exactly a con, but more developmental effort would be required. To create a wrapper on top of the AWS SDK, testing effort etc.
  • Compression & Decompression are CPU intensive. On an average, the 2 together increased our CPU utilization by 22%.
Sign up to request clarification or add additional context in comments.

6 Comments

Thanks Alot Nishit... your answer is well explained.. i have been instructed to use "Compression" in the producer code. i am trying to use Gzip or other methods. Since i write my code in python, i will check if any codes are availble. since this is the first time i need to use compressor method, i am totally unaware about the codes
Python does support Compression & Decompression.
are you sure about that? because i could see many were using GZIP module in python to compress or decompress it?
Yes, we are to use the module to do it. What's the issue there?
do you have any sample code for pushing Gzip file to kinesis?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.