But to get your agent to production, checkpointing is not enough. There is quite a big gap left for you to handle: failure detection, automatic retries, high availability, scale-out, idempotency, concurrency, session coordination, versioning, ...
I wrote a blog post on what’s left to solve, and how to solve it.
TL;DR Instead of tying resiliency together with your agent framework, agents should be built on top of a highly-available orchestration layer that owns the end-to-end execution, guarantees it completes, and handles all of the points above.
Optionally, agent frameworks can be used on top of this to help abstracting away the agent loop.
Is this also how you see it and productionize your agents?