The use case is that 1000s of very small-sized files are uploaded to s3 every minute and all the incoming objects are to be processed and stored in a separate bucket using lambda. But using s3-object-create as a trigger will make many lambda invocations and concurrency needs to be taken care of. I am trying to batch process the newly created objects for every 5-10 minutes. S3 provides batch operations but it reports are generated everyday/week. Is there a service available that can help me?
-
I could ask AWS support to increase lambda concurrency if default 1000 is not enough. Have you considered that? Alterantively, set S3 events to SQS queue as target, and then process files from the sqs at the slower speed.Marcin– Marcin2021-02-05 06:51:37 +00:00Commented Feb 5, 2021 at 6:51
-
yes adding concurrency is a good option but that adds additional costs to the operations. Each file would require at max 2 secs for processing. So I want to increase timeout and make most use of lambda per invocation. The SQS queue seems to be more promising.Tamil Selvan– Tamil Selvan2021-02-05 07:00:21 +00:00Commented Feb 5, 2021 at 7:00
1 Answer
According to AWS documentation, S3 can publish "New object created events" to following destinations:
- Amazon SNS
- Amazon SQS
- AWS Lambda
In your case I would:
- Create SQS.
- Configure S3 Bucket to publish S3 new object events to SQS.
- Reconfigure your existing Lambda to subscribe to SQS.
- Configure batching for input SQS events.
Currently, the maximum batch size for SQS-Lambda subscription is 1000 events. But since your Lambda needs around 2 seconds to process single event, then you should start with something smaller, otherwise Lambda will timeout, because it won't be able to process all of the events.
Thanks to this, uploading X items to S3 will produce X / Y events, where Y is maximum batch size of SQS. For 1000 S3 items and batch size of 100, it will only invoke around 10 concurrent Lambda executions.
The AWS document mentioned above explains, how to publish S3 events to SQS. I won't explain it here, as it's more about implementation details.
Execution time
However you might run into a problem, where the processing is too slow, because Lambda will be processing probably events one-by-one in a loop.
The workaround would be to use asynchronous processing and implementation depends what runtime you use for Lambda, for Node.js it would be very easy to achieve.
Also if you want to speed up the processing in other ways, simply reduce maximum batch size and increase Lambda memory configuration, so single execution will be processing smaller number of events and will have access to more CPU units.