Handling Database Failures in a Distributed System with RabbitMQ Workers

10 pointsby yakirbitan2 days ago6 comments

ciaovietnam2 days ago
I'm not familiar with RabbitMQ but this is how I built a queue for LiteGUI using Redis.
Firstly, every job gets stored in an activity table with a retry and result column.
If the job should be run instantly without queueing, no queue entry is needed, the retry column is 0 and the result column is filled.
If the job fails for whatever reason, a queue entry containing the activity ID is added to let the queue worker process it later on. When the worker processes the job, the retry column will be incremented no matter of the outcome. If the worker succeeds, the activity will be updated with the result.
If the worker fails to process it and the retry number is less than 5, another queue entry will be added (as the current one is already removed from the queue). When to process that queue entry depends on the retry number, they can be 1m, 5m, 25m, 2.1h, 10.4h, 2.2 days using the following formula:
$interval = 300 * pow(5, ($retry - 1));
This approach also helps in case of queue failure, you can just rebuild the queue from the activity table for entries with no result (or status) and retry number is less than 5.
To be honest, I don't work with queue regularly but I have to implement it anyway. I'm sharing this approach so we all can improve it.
4ndrewl2 days ago
Proposed solution 1 seems sensible and using acknowledgements and dlq is a fairly common pattern. You might also want to monitor the size of the DLQ and if it reaches a certain limit stop processing altogether. You can also alert based on the size of the queue.
Don't forget a mechanism to redrive back from the DLQ and consider if order is important (might be if you're using a FIFO queue, but unsure as to whether Rabbit supports that)
kbouck2 days ago
Proposed Solution 1 is preferable in that it accounts for both DB outage, slowness, and worker crash, and you describe additional safety mechanisms to prevent queue becoming blocked by poison/invalid messages.
Proposed Solution 2 without ACKs would be vulnerable to message loss if a worker were to crash before successful message delivery.
- gregw22 days ago
  This, I think. I also didn't see how your solution 2 recovers from worker crashes, although I was sympathetic to distributing the retry to workers.
gregw22 days ago
Not your original question, but you should add some random jitter to your exponential-backoff-like delay intervals.
akarl8182 days ago
In solution 1, the consumer should cancel its consumption until the database is back up.
quintes2 days ago
Solution 1 DLQ that you retry but what’s reason for the database failure?