1

The use case is that 1000s of very small-sized files are uploaded to s3 every minute and all the incoming objects are to be processed and stored in a separate bucket using lambda. But using s3-object-create as a trigger will make many lambda invocations and concurrency needs to be taken care of. I am trying to batch process the newly created objects for every 5-10 minutes. S3 provides batch operations but it reports are generated everyday/week. Is there a service available that can help me?

2
  • I could ask AWS support to increase lambda concurrency if default 1000 is not enough. Have you considered that? Alterantively, set S3 events to SQS queue as target, and then process files from the sqs at the slower speed. Commented Feb 5, 2021 at 6:51
  • yes adding concurrency is a good option but that adds additional costs to the operations. Each file would require at max 2 secs for processing. So I want to increase timeout and make most use of lambda per invocation. The SQS queue seems to be more promising. Commented Feb 5, 2021 at 7:00

1 Answer 1

1

According to AWS documentation, S3 can publish "New object created events" to following destinations:

  • Amazon SNS
  • Amazon SQS
  • AWS Lambda

In your case I would:

  1. Create SQS.
  2. Configure S3 Bucket to publish S3 new object events to SQS.
  3. Reconfigure your existing Lambda to subscribe to SQS.
  4. Configure batching for input SQS events.

Currently, the maximum batch size for SQS-Lambda subscription is 1000 events. But since your Lambda needs around 2 seconds to process single event, then you should start with something smaller, otherwise Lambda will timeout, because it won't be able to process all of the events.

Thanks to this, uploading X items to S3 will produce X / Y events, where Y is maximum batch size of SQS. For 1000 S3 items and batch size of 100, it will only invoke around 10 concurrent Lambda executions.

The AWS document mentioned above explains, how to publish S3 events to SQS. I won't explain it here, as it's more about implementation details.

Execution time

However you might run into a problem, where the processing is too slow, because Lambda will be processing probably events one-by-one in a loop.

The workaround would be to use asynchronous processing and implementation depends what runtime you use for Lambda, for Node.js it would be very easy to achieve.

Also if you want to speed up the processing in other ways, simply reduce maximum batch size and increase Lambda memory configuration, so single execution will be processing smaller number of events and will have access to more CPU units.

Sign up to request clarification or add additional context in comments.

8 Comments

Do you know of a way to do this that also guarantees a complete sweep of the SQS if there are only a few events there? SQS doesn't reliably return a full set of SQS events, it just randomly picks events from different SQS nodes and is rarely a complete sweep. I've been messing with WaitTimeSeconds and MaxNumberOfMessages trying to achieve this with no success.
@medley56 I would try with FIFO SQS, so it won't be randomly picking events from different nodes if that's the issue. Also I wonder if maybe you are having an issue with Lambda returning fail status? If Lambda execution fails, then events won't be deleted from SQS and they will still be there to retrieve. Another idea is to play around with batch window by reducing it, then if batching exceeds 30 seconds for example, it must forward all of the events left in SQS to Lambda
Yeah, it's on my list to try a FIFO queue. Is there a way to actually sweep clean a FIFO queue reliably?
Shouldn't it eventually return all of the events from SQS anyways? I mean normally SQS guarantees at-least-once delivery, so if you need strong consistency, you need to go for FIFO anyways. In general I never had to worry about getting ALL of the messages from SQS, it was just working fine, but never had any strong consistency requirements
I've experimented a bit with looping over SQS polling requests (ugly, I know) and eventually yes, I get all the events but it usually takes 2-4 SQS requests to retrieve 5 events.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.