Requests would partially stream, providers would throttle or fail mid-stream, and retry logic ended up scattered across background jobs, webhooks, and request handlers.
I built ModelRiver as a thin API layer that sits between an app and AI providers and centralizes streaming, retries, failover, and request-level debugging in one place.
It’s early and opinionated, and there are tradeoffs. Happy to answer technical questions or hear how others are handling streaming reliability in production AI apps.
One pattern I've found useful: having a read-only view of what's actually hitting the wire before any retry logic kicks in. When you can see the raw request/response as it happens, you can tell whether the issue is your payload, the provider throttling, or something in between.
We built toran.sh for this - it's a transparent proxy that shows exactly what goes out and comes back in real-time. Different layer than what you're doing (you handle the orchestration, we just show the traffic), but they complement each other.
Curious how you handle visibility into what's actually being sent during partial stream failures?
ModelRiver already has this covered via request logs. Every request captures the full lifecycle, the exact payload sent to the provider, streaming chunks as they arrive, partial responses, errors, retries, and the final outcome. Even if a stream fails midway, you can still inspect what was sent and what came back before the failure.
So you can clearly tell whether the issue is payload shape, provider throttling, or a mid stream failure, before any retry or failover logic kicks in. That wire level visibility is core to how we approach debugging async AI requests.