Today I learned that if a function fails to process a message from an Amazon Queue (SQS), the default behaviour is to retry the same message over and over again. This happens at quite a pace, each attempt to reprocess the message racking up cost.
One approach to handle this is to set the message expiry (defaults to 4 days). Another is to set-up a dead-letter queue. This takes messages that have failed processing a configurable number of times and drops them on a separate queue, removing them from the normal data flow. It is possible to set-up alerts or monitoring on the dead-letter queue.
I had originally configured my BuddyBot functions to return success even if they were unable to process the messages in the hope that I could avoid handling errors properly. This turned out to be a mistake as for a brief period yesterday I deployed Mac OSX binaries instead of Linux binaries. This caused all message processing to fail and build up a fairly rapid retry storm. What I should have done was configure the queue correctly to handle failure scenarios appropriate for the context.
I made the following changes:
- messages now time out after 10 minutes
- failed messages will be sent to the dead-letter queue after 2 failed attempts
I’ve pushed the changes through to production so hopefully won’t be seeing anymore infinite retries. An interesting reminder in the essentials for using queues in a system.