A few rules that have saved us:
- Persist before responding. Never process inline. Write payload to DB, return 200 fast.
- Idempotency key required. Either provider event ID or hash the payload.
- Async worker processes from queue. Exponential backoff + max attempts.
- Dead letter queue + dashboard. Humans need visibility.
- Alert on backlog growth, not single failures. One failure is noise. A growing retry queue is signal.
- Relying on provider retries alone has bitten us more than once.
Idempotency becomes your responsibility, though, since messages can be delivered more than once.