Agent checkpointing is far from production-grade resiliency(www.restate.dev)

3 pointsby gvdongen4 hours ago2 comments

gvdongen4 hours ago
As agents run longer and spend more money, many agent frameworks are adding resiliency features like checkpoint recovery and pause-resume approvals.
But to get your agent to production, checkpointing is not enough. There is quite a big gap left for you to handle: failure detection, automatic retries, high availability, scale-out, idempotency, concurrency, session coordination, versioning, ...
I wrote a blog post on what’s left to solve, and how to solve it.
TL;DR Instead of tying resiliency together with your agent framework, agents should be built on top of a highly-available orchestration layer that owns the end-to-end execution, guarantees it completes, and handles all of the points above.
Optionally, agent frameworks can be used on top of this to help abstracting away the agent loop.
Is this also how you see it and productionize your agents?
yubudong3 hours ago
[flagged]