0

Can I compress data from Azure Blob to gzip as I download it? I would like to avoid having all data in memory if possible.

I tried two different approaches (compress_chunk and compress_blob) functions. I am not sure if the entire blob was in memory though before compression, or if I can compress it as it is read in somehow.

def compress_chunk(data):
    data.seek(0)
    compressed_body = io.BytesIO()
    compressor = gzip.open(compressed_body, mode='wb')
    while True:
        chunk = data.read(1024 * 1024 * 4)
        if not chunk:
            break
        compressor.write(chunk)
    compressor.flush()
    compressor.close()
    compressed_body.seek(0, 0)
    return compressed_body

def compress_blob(data):
    compressed_body = gzip.compress(data.getvalue())
    return compressed_body

def process_download(container_name, blob):
    with io.BytesIO() as input_io:
        blob_service.get_blob_to_stream(container_name=container_name, blob_name=blob.name, stream=input_io)
        compressed_body = compress_chunk(data=input_io)
3
  • I don't suppose you ever figured out how to do this without reading everything into memory or a local storage? Commented Apr 17, 2023 at 12:32
  • I don't believe so. This might have started my journey with converting everything to Parquet files instead and I just let pyarrow handle the compression stuff Commented Apr 17, 2023 at 14:24
  • We're on the route with Parquet files but because of legacy support for now had to do this. I eventually came across a library called smart_open that helped me solve my problem. Definitely worth a shot to anyone who sees this comment. Using smart_open with gzip to compress blobs. Commented Apr 20, 2023 at 10:23

1 Answer 1

1

I think you know how to compress data. So the following is just to make some clarifications.

I am not sure if the entire blob was in memory though before compression.

When we need to download the blob data for processing, we use the official method to download the blob. At this time, it is in the form of a stream. It is not on the disk, but of course it will use the memory allocated by the program.

Azure didn't provide a method to pre-compress data on azure:

https://learn.microsoft.com/en-us/python/api/azure-storage-blob/azure.storage.blob.blobclient?view=azure-python#methods

Therefore, when we want to process data, we must first download it, and when it is downloaded as a stream, it will of course take up memory.

Sign up to request clarification or add additional context in comments.

1 Comment

And there is no way to compress "chunks" as they are downloaded? As I understand it, when you download data from a stream it is being downloaded in chunks bytes - I wonder if each chunk could be compressed to minimize the memory footprint?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.