Merge big Azure blobs to one local file using python

Question

I'm using python 3.6 with python SDKs for Azure blob version 1.5.0 and would like to merge multiple Azure blobs into single local file. I managed to do that but when I try to append a blob which exceeds the machine memory the operation fails. What is the best way to write the blob content into a file by chunks? This is my code which does not work for blobs that are bigger than the machine memory

blob_files_names = blob_service.list_blob_names(container_name=blob_container_name, prefix=prefix)
with open(trg_path, 'wb') as file:
    for blob_file_name in blob_files_names:
        blob = blob_service.get_blob_to_bytes(container_name=blob_container_name, blob_name=blob_file_name)
        file.write(blob.content)

Chen Meyouhas · Accepted Answer · 2020-01-30 09:53:57Z

1

Finally I managed to do it by using the function get_blob_to_path with append open mode. It works because this function write each blob content to the end of the file by chunks of size MAX_CHUNK_GET_SIZE

blob_files_names = blob_service.list_blob_names(container_name=blob_container_name, prefix=prefix)
for blob_file_name in blob_files_names:
    blob_service.get_blob_to_path(container_name=blob_container_name, blob_name=blob_file_name,
                                  file_path=trg_path, max_connections=1, open_mode='ab')

edited Jan 30, 2020 at 9:53

answered Jan 30, 2020 at 9:33

Chen Meyouhas

6563 silver badges8 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Peter Pan · Accepted Answer · 2020-01-30 09:27:52Z

0

Per my experience, it sounds like there are some large blobs which size exceed the memory size of your local machine, because the function get_blob_to_bytes you used will temporarily read blob content to memory for writing later.

So please use the other function get_blob_to_stream instead of get_blob_to_bytes.

Here is my sample code, my virtual environment is based on Python 3.7 with Azure Storage SDK via pip install azure-storage-blob==1.5.0.

from azure.storage.blob.baseblobservice import BaseBlobService

account_name = '<your account name>'
account_key = '<your account key>'

blob_service = BaseBlobService(account_name, account_key)

blob_container_name = '<your container name>'
prefix = '<your blob prefix>'
blob_files_names = blob_service.list_blob_names(container_name=blob_container_name, prefix=prefix)

# from io import BytesIO
from io import FileIO

trg_path = '<your target file path>'
# with open(trg_path, 'wb') as file:
with FileIO(trg_path, 'wb') as file:
    for blob_file_name in blob_files_names:
        #blob = blob_service.get_blob_to_bytes(container_name=blob_container_name, blob_name=blob_file_name)
        print(blob_file_name)
        # stream = BytesIO()
        blob_service.get_blob_to_stream(container_name=blob_container_name, blob_name=blob_file_name, stream=file)
        # file.write(stream.getbuffer())

~~Note: the getbuffer() of the stream of BytesIO above will not create a copy of the values in the BytesIO buffer and will hence not consume large amounts of memory.~~

edited Jan 30, 2020 at 9:27

answered Jan 29, 2020 at 19:07

Peter Pan

24.2k4 gold badges31 silver badges47 bronze badges

1 Comment

Chen Meyouhas Over a year ago

Thanks Peter, I tried it and it doesn't work. The whole blob content is written to the memory before writing to the disk. I probably need to write it by chunks.

Collectives™ on Stack Overflow

Merge big Azure blobs to one local file using python

2 Answers 2

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related