5

I need to archive multiply files that exists on s3 and then upload the archive back to s3. I am trying to use lambda and python. As some of the files have more than 500MB, downloading in the '/tmp' is not an option. Is there any way to stream files one by one and put them in archive?

3
  • Yes, there is. What did you search for, and what did you find? What did you try, and how did it fail? If 500MB is too much for your /tmp, increasing the space there seems like the easiest way forward; if you don't have a lot of disk, what are the chances you have enough memory to keep the file in RAM entirely? Commented Jun 21, 2021 at 10:00
  • Lambda could prove expensive for this task IMO. Commented Sep 10, 2021 at 21:45
  • Since this Question was written, AWS Lambda has added the ability to request larger /tmp/ storage. Commented Jan 11, 2023 at 22:46

4 Answers 4

7

Do not write to disk, stream to and from S3

Stream the Zip file from the source bucket and read and write its contents on the fly using Python back to another S3 bucket.

This method does not use up disk space and therefore is not limited by size.

The basic steps are:

  • Read the zip file from S3 using the Boto3 S3 resource Object into a BytesIO buffer object
  • Open the object using the zipfile module
  • Iterate over each file in the zip file using the namelist method
  • Write the file back to another bucket in S3 using the resource meta.client.upload_fileobj method

The Code Python 3.6 using Boto3

s3_resource = boto3.resource('s3')
zip_obj = s3_resource.Object(bucket_name="bucket_name_here", key=zip_key)
buffer = BytesIO(zip_obj.get()["Body"].read())

z = zipfile.ZipFile(buffer)
for filename in z.namelist():
    file_info = z.getinfo(filename)
    s3_resource.meta.client.upload_fileobj(
        z.open(filename),
        Bucket=bucket,
        Key=f'{filename}'
    )

Note: AWS Execution time limit has a maximum of 15 minutes so can you process your HUGE files in this amount of time? You can only know by testing.

Sign up to request clarification or add additional context in comments.

1 Comment

I think this OP asked for the other way around. He needs to zip files not unzip them.
6

AWS Lambda code: create zip from files by ext in bucket/filePath.


def createZipFileStream(bucketName, bucketFilePath, jobKey, fileExt, createUrl=False):
    response = {} 
    bucket = s3.Bucket(bucketName)
    filesCollection = bucket.objects.filter(Prefix=bucketFilePath).all() 
    archive = BytesIO()

    with zipfile.ZipFile(archive, 'w', zipfile.ZIP_DEFLATED) as zip_archive:
        for file in filesCollection:
            if file.key.endswith('.' + fileExt):   
                with zip_archive.open(file.key, 'w') as file1:
                    file1.write(file.get()['Body'].read())  

    archive.seek(0)
    s3.Object(bucketName, bucketFilePath + '/' + jobKey + '.zip').upload_fileobj(archive)
    archive.close()

    response['fileUrl'] = None

    if createUrl is True:
        s3Client = boto3.client('s3')
        response['fileUrl'] = s3Client.generate_presigned_url('get_object', Params={'Bucket': bucketName,
                                                                                    'Key': '' + bucketFilePath + '/' + jobKey + '.zip'},
                                                              ExpiresIn=3600)

    return response
    

1 Comment

Just saw your answer. I have implemented it in the same way and it works fine for my case
0

The /tmp/ directory is limited to 512MB for AWS Lambda functions.

If you search StackOverflow, you'll see some code from people who have created Zip files on-the-fly without saving files to disk. It becomes pretty complicated.

An alternative would be to attach an EFS filesystem to the Lambda function. It takes a bit of effort to setup, but the cost would be practically zero if you delete the files after use and you'll have plenty of disk space so your code will be more reliable and easier to maintain.

1 Comment

thanks for the response. is there a reason why the answer from @Anilkumar S.K is too complicated or insufficient? stackoverflow.com/a/68069842/144088
0
# For me below code worked for single file in Glue job to take single .txt file form AWS S3 and make it zipped and upload back to AWS S3. 
import boto3
import zipfile
from io import BytesIO
import logging
logger = logging.getLogger()

s3_client = boto3.client('s3')
s3_resource= boto3.resource('s3')

# ZipFileStream function declaration
self._createZipFileStream(
                    bucketName="My_AWS_S3_bucket_name",
                    bucketFilePath="My_txt_object_prefix", 
                    bucketfileobject="My_txt_Object_prefix + txt_file_name",
                    zipKey="My_zip_file_prefix")

# ZipFileStream function Defination
def _createZipFileStream(self, bucketName: str, bucketFilePath: str, bucketfileobject: str, zipKey: str, ) -> None:
    try:
        obj = s3_resource.Object(bucket_name=bucketName, key=bucketfileobject)
        archive = BytesIO()

        with zipfile.ZipFile(archive, 'w', zipfile.ZIP_DEFLATED) as zip_archive:
            with zip_archive.open(zipKey, 'w') as file1:
                file1.write(obj.get()['Body'].read())  

        archive.seek(0)

        s3_client.upload_fileobj(archive, bucketName, bucketFilePath + '/' + zipKey + '.zip')
        archive.close()
            
        # If you would like to delete the .txt after zipped from AWS S3 below code will work. 
        self._delete_object(
                bucket=bucketName, key=bucketfileobject)

    except Exception as e:
        logger.error(f"Failed to zip the txt file for {bucketName}/{bucketfileobject}: str{e}")

# Delete AWS S3 funcation defination.
def _delete_object(bucket: str, key: str) -> None:
        try:
            logger.info(f"Deleting: {bucket}/{key}")
            S3.delete_object(
                Bucket=bucket,
                Key=key
            )
        except Exception as e:
            logger.error(f"Failed to delete {bucket}/{key}: str{e}")`enter code here`

1 Comment

Please consider adding some explanation to the source code explaining how it solves the problem.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.