Compressing a stream from Azure Blob (Python SDK)

Question

Can I compress data from Azure Blob to gzip as I download it? I would like to avoid having all data in memory if possible.

I tried two different approaches (compress_chunk and compress_blob) functions. I am not sure if the entire blob was in memory though before compression, or if I can compress it as it is read in somehow.

def compress_chunk(data):
    data.seek(0)
    compressed_body = io.BytesIO()
    compressor = gzip.open(compressed_body, mode='wb')
    while True:
        chunk = data.read(1024 * 1024 * 4)
        if not chunk:
            break
        compressor.write(chunk)
    compressor.flush()
    compressor.close()
    compressed_body.seek(0, 0)
    return compressed_body

def compress_blob(data):
    compressed_body = gzip.compress(data.getvalue())
    return compressed_body

def process_download(container_name, blob):
    with io.BytesIO() as input_io:
        blob_service.get_blob_to_stream(container_name=container_name, blob_name=blob.name, stream=input_io)
        compressed_body = compress_chunk(data=input_io)

I don't suppose you ever figured out how to do this without reading everything into memory or a local storage? — Marco Fernandes
– Marco Fernandes, Commented Apr 17, 2023 at 12:32
I don't believe so. This might have started my journey with converting everything to Parquet files instead and I just let pyarrow handle the compression stuff — ldacey
– ldacey, Commented Apr 17, 2023 at 14:24
We're on the route with Parquet files but because of legacy support for now had to do this. I eventually came across a library called smart_open that helped me solve my problem. Definitely worth a shot to anyone who sees this comment. Using smart_open with gzip to compress blobs. — Marco Fernandes
– Marco Fernandes, Commented Apr 20, 2023 at 10:23

suziki · Accepted Answer · 2020-08-28 02:35:08Z

1

I think you know how to compress data. So the following is just to make some clarifications.

I am not sure if the entire blob was in memory though before compression.

When we need to download the blob data for processing, we use the official method to download the blob. At this time, it is in the form of a stream. It is not on the disk, but of course it will use the memory allocated by the program.

Azure didn't provide a method to pre-compress data on azure:

https://learn.microsoft.com/en-us/python/api/azure-storage-blob/azure.storage.blob.blobclient?view=azure-python#methods

Therefore, when we want to process data, we must first download it, and when it is downloaded as a stream, it will of course take up memory.

answered Aug 28, 2020 at 2:35

suziki

14.2k1 gold badge20 silver badges31 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

ldacey Over a year ago

And there is no way to compress "chunks" as they are downloaded? As I understand it, when you download data from a stream it is being downloaded in chunks bytes - I wonder if each chunk could be compressed to minimize the memory footprint?

Collectives™ on Stack Overflow

Compressing a stream from Azure Blob (Python SDK)

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related