2 pointsby stsffap7 hours ago2 comments

stsffap7 hours ago
Hey HN, I work on Restate.
We kept seeing the same problem with AI agents in production: you can observe them (traces, logs, dashboards), but when something goes wrong, your options are basically restart the process and lose all progress, or wait and hope.
We built Restate as a durable execution engine, and it turns out the primitives it provides, journaling every step, giving each execution a stable ID, map really well onto the control problem for agents.
This post walks through concrete scenarios: cancelling hundreds of agents stuck retrying a dead endpoint, pausing agents during DB maintenance instead of letting them burn through retries, and restarting a failed three-hour workflow from the exact step that failed (without redoing the expensive work before it).
Curious what control problems others have hit with agents in production. Happy to answer questions.
- verdverm7 hours ago
  Any framework worth it's muster is already handling this. I use ADK
  - stsffap5 hours ago
    ADK is great framework for building agents as it is runtime agnostic and you can choose which properties are important to you. You can run your ADK agents also on Restate (https://google.github.io/adk-docs/integrations/restate) if you want to turn your agents into durable agents that can reliably communicate with other agents.
BlueHotDog27 hours ago
this is cool! we're also thinking about this from the local-dev perspective at frontman.sh! love the animations on the blog