Serverless computing and event-driven functions are what it’s all about at the moment. But what happens when the event trigger fires, and your process then encounters an error? How do you recover from this given the event has since passed and may never happen again? This is a common question in AWS when working with their serverless, event-driven Lambda Functions.
Fortunately, AWS lets you define Dead Letter Queues for this very scenario. This option allows you to designate either an SQS queue or SNS topic as a DLQ, meaning that when your Lambda function fails it will push the incoming event message (and some additional context) onto the specified resource. If it’s SNS you can send out alerts or trigger other services (maybe even a retry of the same function - although watch out for infinite loops), or any combination of the above, given its fanout nature. If it’s SQS you can persist the message and process it with another service.
So let’s look at both options in a little more detail.
Using SQS as a Dead Letter Queue (DLQ) ensures that you have a durable store for failed events that can be monitored (allowing necessary services/individuals to be alerted) and picked up for resolution at your convenience. This allows you to process failures in bulk, have a defined wait period before re-triggering the original event, or taking some other steps to resolution.
The fact that you don’t reprocess the event straight away gives you a little more flexibility around when and how you deal with lambda failures.
SNS or Simple Notification Service is a key part of AWS’s event-driven offering, letting you process events almost instantaneously and fan-out to multiple subscribers. It’s a great way to integrate applications in a microservices architecture. You can also use an SNS Topic as a Dead Letter Queue (DLQ). This has the benefit of allowing you to instantly take action on failure, whether that be attempting to re-process the message, alert an individual/process, store the event message somewhere for follow up, or any combination/all of the above.
The key to the SNS approach is its flexibility in sending messages to multiple subscribers. it allows you to take some action immediately, while also passing the message to other, more suitable systems where it can be picked up and processed.
A pattern that works rather well, and offers the best of both worlds, is to combine both SNS and SQS as in the diagram above. By defining an SNS Topic as the DLQ, and having an SQS subscriber attached to the SNS Topic, you can have your durable store in the SQS queue, while also taking instant action. The only caveat is that if you are re-attempting to process the message and this time it succeeds, you need some way to tell SQS so that you can remove the message from the queue.
Not perfect by any stretch, but it gives a little of the benefit of both.
There are a huge number of different patterns (and anti-patterns) out there for implementing SQS and SNS, as well as Lamba and event-driven patterns in general. The two above are just a basic representation that work well in certain scenarios. I’d be really interested to hear from other people who have worked with serverless/event-driven on AWS and what your opinions are, as well as any patterns you’ve found to be a good way of managing DLQs.
Please leave your comments thoughts below!