1 pointby skhatter9 hours ago1 comment
  • skhatter9 hours ago
    I've been experimenting with AI agents and multi-step workflows recently and ran into a problem that reminded me a lot of early distributed systems.

    Once agents start calling tools, APIs, and other agents in a chain, debugging failures becomes surprisingly hard. A single task can involve multiple steps—LLM calls, tool invocations, retries—and when something breaks it's often difficult to understand exactly what happened or where the failure originated.

    In traditional distributed systems we eventually built things like tracing, circuit breakers, retry policies, SLOs, and other reliability primitives to operate systems safely in production.

    I'm curious how people building agent systems today are handling this.

    Some questions I'm particularly interested in: - How do you debug agent failures? - Do you have visibility into multi-agent workflows? - How do you replay or reproduce failures?

    I've been exploring this problem space and built a small prototype to experiment with reliability tooling for agent systems. The link above shows the demo, but I'm mainly interested in learning how others are approaching this problem.

    • verdverm8 hours ago
      OTEL and LGTM, the same open source o11y stack I use for everything
      • skhatter8 hours ago
        Interesting — are you instrumenting the agent workflows themselves with OpenTelemetry spans?

        I was wondering how well the standard o11y stack works once agents start running multi-step workflows (agent → tools → other agents → APIs). Tracing probably helps visualize the steps, but I'm curious how people handle operational things like retries, replaying failed workflows, or containing cascading failures across agents.

        Those reliability aspects are what I've been exploring.