Ask HN: How do you monitor and retry failed webhooks in production?

2 pointsby GoatPerfect4 hours ago3 comments

blundergoat4 hours ago
We treat webhooks as at-least-once delivery over an unreliable transport and design for duplicates and out-of-order events.
A few rules that have saved us:
- Persist before responding. Never process inline. Write payload to DB, return 200 fast.
- Idempotency key required. Either provider event ID or hash the payload.
- Async worker processes from queue. Exponential backoff + max attempts.
- Dead letter queue + dashboard. Humans need visibility.
- Alert on backlog growth, not single failures. One failure is noise. A growing retry queue is signal.
- Relying on provider retries alone has bitten us more than once.
- GoatPerfect3 hours ago
  Thank you so much for tips! I was feeling nervous about relying on provider retires as well. I especially like the idea of alerting on backlog growth. There's nothing I hate more than a bunch of emails and notifications!
  - chickensong3 hours ago
    This was a nice goat exchange
JacobArthurs3 hours ago
We receive the webhook, return 200 immediately, and push the payload to a message queue for processing. That way you own the retry logic, can inspect stuck messages, and DLQ alerts handle repeated failures automatically.
Idempotency becomes your responsibility, though, since messages can be delivered more than once.
toomuchtodo4 hours ago
Have you checked out https://svix.com? No affiliation, I just like the product. Might also check out https://www.standardwebhooks.com/
- GoatPerfect3 hours ago
  I just checked them out! Looks like it would make handling failures a breeze!