1

I'm using python 3.6 with python SDKs for Azure blob version 1.5.0 and would like to merge multiple Azure blobs into single local file. I managed to do that but when I try to append a blob which exceeds the machine memory the operation fails. What is the best way to write the blob content into a file by chunks? This is my code which does not work for blobs that are bigger than the machine memory

blob_files_names = blob_service.list_blob_names(container_name=blob_container_name, prefix=prefix)
with open(trg_path, 'wb') as file:
    for blob_file_name in blob_files_names:
        blob = blob_service.get_blob_to_bytes(container_name=blob_container_name, blob_name=blob_file_name)
        file.write(blob.content)

2 Answers 2

1

Finally I managed to do it by using the function get_blob_to_path with append open mode. It works because this function write each blob content to the end of the file by chunks of size MAX_CHUNK_GET_SIZE

blob_files_names = blob_service.list_blob_names(container_name=blob_container_name, prefix=prefix)
for blob_file_name in blob_files_names:
    blob_service.get_blob_to_path(container_name=blob_container_name, blob_name=blob_file_name,
                                  file_path=trg_path, max_connections=1, open_mode='ab')
Sign up to request clarification or add additional context in comments.

Comments

0

Per my experience, it sounds like there are some large blobs which size exceed the memory size of your local machine, because the function get_blob_to_bytes you used will temporarily read blob content to memory for writing later.

So please use the other function get_blob_to_stream instead of get_blob_to_bytes.

Here is my sample code, my virtual environment is based on Python 3.7 with Azure Storage SDK via pip install azure-storage-blob==1.5.0.

from azure.storage.blob.baseblobservice import BaseBlobService

account_name = '<your account name>'
account_key = '<your account key>'

blob_service = BaseBlobService(account_name, account_key)

blob_container_name = '<your container name>'
prefix = '<your blob prefix>'
blob_files_names = blob_service.list_blob_names(container_name=blob_container_name, prefix=prefix)

# from io import BytesIO
from io import FileIO

trg_path = '<your target file path>'
# with open(trg_path, 'wb') as file:
with FileIO(trg_path, 'wb') as file:
    for blob_file_name in blob_files_names:
        #blob = blob_service.get_blob_to_bytes(container_name=blob_container_name, blob_name=blob_file_name)
        print(blob_file_name)
        # stream = BytesIO()
        blob_service.get_blob_to_stream(container_name=blob_container_name, blob_name=blob_file_name, stream=file)
        # file.write(stream.getbuffer())

Note: the getbuffer() of the stream of BytesIO above will not create a copy of the values in the BytesIO buffer and will hence not consume large amounts of memory.

1 Comment

Thanks Peter, I tried it and it doesn't work. The whole blob content is written to the memory before writing to the disk. I probably need to write it by chunks.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.