Checkpointing conversation state + sandbox filesystem mid-run so agents can resume on failure will be key for operating at scale. And a unified dashboard across all running agents is the goal once the core scheduling and execution layer is solid.
I appreciate the feedback!