3

I am trying to move files older than a hour from one s3 bucket to another s3 bucket using python boto3 AWS lambda function with following cases:

  1. Both buckets can be in same account and different region.
  2. Both buckets can be in different account and different region.
  3. Both buckets can be in different account and same region.

I got some help to move files using the python code mentioned by @John Rotenstein

import boto3
from datetime import datetime, timedelta

SOURCE_BUCKET = 'bucket-a'
DESTINATION_BUCKET = 'bucket-b'

s3_client = boto3.client('s3')

# Create a reusable Paginator
paginator = s3_client.get_paginator('list_objects_v2')

# Create a PageIterator from the Paginator
page_iterator = paginator.paginate(Bucket=SOURCE_BUCKET)

# Loop through each object, looking for ones older than a given time period
for page in page_iterator:
    for object in page['Contents']:
        if object['LastModified'] < datetime.now().astimezone() - timedelta(hours=1):   # <-- Change time period here
            print(f"Moving {object['Key']}")

            # Copy object
            s3_client.copy_object(
                Bucket=DESTINATION_BUCKET,
                Key=object['Key'],
                CopySource={'Bucket':SOURCE_BUCKET, 'Key':object['Key']}
            )

            # Delete original object
            s3_client.delete_object(Bucket=SOURCE_BUCKET, Key=object['Key'])

How can this be modified to cater the requirement

2 Answers 2

3

An alternate approach would be to use Amazon S3 Replication, which can replicate bucket contents:

  • Within the same region, or between regions
  • Within the same AWS Account, or between different Accounts

Replication is frequently used when organizations need another copy of their data in a different region, or simply for backup purposes. For example, critical company information can be replicated to another AWS Account that is not accessible to normal users. This way, if some data was deleted, there is another copy of it elsewhere.

Replication requires versioning to be activated on both the source and destination buckets. If you require encryption, use standard Amazon S3 encryption options. The data will also be encrypted during transit.

You configure a source bucket and a destination bucket, then specify which objects to replicate by providing a prefix or a tag. Objects will only be replicated once Replication is activated. Existing objects will not be copied. Deletion is intentionally not replicated to avoid malicious actions. See: What Does Amazon S3 Replicate?

There is no "additional" cost for S3 replication, but you will still be charge for any Data Transfer charges when moving objects between regions, and for API Requests (that are tiny charges), plus storage of course.

Sign up to request clarification or add additional context in comments.

5 Comments

sounds perfect.
can I increase the replication time to 1 hour from 15 min?
The "replication time" is automatic. In situations where you need additional control over replication time, you can use the Replication Time Control feature. See: S3 Replication Update: Replication SLA, Metrics, and Events | AWS News Blog
I could not find if I can control the replication frequency. I raised a support request too, they said it is not possible. Could you please help me here
Replication is automatic and continuous. It is not a frequency, it's more like "it's in the queue, it'll take a few minutes".
3

Moving between regions

This is a non-issue. You can just copy the object between buckets and Amazon S3 will figure it out.

Moving between accounts

This is a bit harder because the code will use a single set of credentials must have ListBucket and GetObject access on the source bucket, plus PutObject rights to the destination bucket.

Also, if credentials are being used from the Source account, then the copy must be performed with ACL='bucket-owner-full-control' otherwise the Destination account won't have access rights to the object. This is not required when the copy is being performed with credentials from the Destination account.

Let's say that the Lambda code is running in Account-A and is copying an object to Account-B. An IAM Role (Role-A) is assigned to the Lambda function. It's pretty easy to give Role-A access to the buckets in Account-A. However, the Lambda function will need permissions to PutObject in the bucket (Bucket-B) in Account-B. Therefore, you'll need to add a bucket policy to Bucket-B that allows Role-A to PutObject into the bucket. This way, Role-A has permission to read from Bucket-A and write to Bucket-B.

So, putting it all together:

  • Create an IAM Role (Role-A) for the Lambda function
  • Give the role Read/Write access as necessary for buckets in the same account
  • For buckets in other accounts, add a Bucket Policy that grants the necessary access permissions to the IAM Role (Role-A)
  • In the copy_object() command, include ACL='bucket-owner-full-control' (this is the only coding change needed)
  • Don't worry about doing any for cross-region, it should just work automatically

14 Comments

also I was going through the post stackoverflow.com/questions/43577746/aws-lambda-task-timed-out/…, where you have mentioned the timeout value can be 15 min max. But the buckets objects in my case are more than 5 GB so what could be the better solution here AWS fargate?
hey @John Rotenstein could you let me know this query?
If the copy operation takes longer than 15 minutes, then Lambda is not appropriate. Copying between regions would also make the operation take longer. To recommend a method, I would need to know more information: How often do files arrive (or how many per hour or day)? Do you need them to be copied quickly, or can they be copied once per day? How does the program determine where to copy the files? (If it is based on directory, then S3 Replication could do it for you automatically.)
hey @John Rotenstein So the s3 bucket will have folders that will have files, the content of these folders should be copied to folder in buckets in other region. So what do you suggest? replication would fit in? what are the pros and cons here. The files would be coming every minute, lets assume 1 minute 1 file, or there can be 1000 files per hour, each file would be 300 MB, what would be the best solution in that case. Also with s3 replication can I delete the content that has been copied to other buckets ? requires encryption,version & copying every 15 minute would incur cost?
No idea! You'll have to add some debug statements in that area to see what it is doing. It might be happening when page does not contain an element called Contents. That might happen at the end (which is why things are being moved), but it still shouldn't fail at that point. You could put if 'Contents' in page then: before that line to avoid the situation.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.