4 pointsby jairooh2 days ago10 comments
  • zhangchen6 hours ago
    Langfuse + custom OTEL spans has been the most practical combo for us. The key insight was treating each agent step as a trace span with token counts and latency, then setting alerts on cost-per-task rather than raw token volume.
  • RovaAI6 hours ago
    devonkelley's dashcam framing is right. The useful question isn't "how do I see what happened" - it's "how do I catch irreversible actions before they happen."

    The failure modes from those incidents aren't really observability gaps. They're about permission scope and action reversibility. An agent deleting a database doesn't need better logging after the fact - it needs a clear model of what's reversible and what isn't, built into the execution loop.

    What works: classify every action as either local/reversible (reads, file edits, drafts) or external/irreversible (sends, deletes, pushes, payments). The former runs autonomously. The latter gets a confirmation checkpoint with no exceptions. That one split eliminates most incident surface area without needing a dedicated SDK.

    Langfuse/LangSmith are useful for cost tracking and debugging post-hoc. But they're tools for the team, not the agent. The reversibility model needs to be at the framework level.

  • Horos2 days ago
    ACID & Idempotent. dataplane / controlplane. dryruns et runbook automations.

    llm does not act on production. he build scripts, and you take the greatest care of theses scripts.

    Clone you customer data and run evertything blank.

    Just uses the llm tool as dangerous tool: considere that it will fail each time it's able to.

    even will all theses llm specific habitus, you still get a x100 productivity.

    because each of theses advise can ben implemented by llms, for llms, by many way. it's almost free. just plan it.

  • 2 days ago
    undefined
  • devonkelley2 days ago
    Most observability tools in this space are dashcams. They show you what happened after you already got robbed.

    The gap isn't monitoring. It's what happens automatically when degradation gets detected. Right now the answer for every team I've talked to is "page a human." That human reads logs, guesses, deploys a fix. The system already shifted while they were debugging.

  • al_borland20 hours ago
    I can’t imagine giving an agent access to production.
  • zarathustra33321 hours ago
    Braintrust is great!
  • verdverm2 days ago
    OTEL & LGTM, the same stack I use for monitoring everything, on a technical level.

    Some of the things you mention are more often addressed by guardrails. Some of the others (quality) require some evaluation for that measure, but results can go into the same monitory stack.

  • jamiemallers7 hours ago
    [dead]
  • 0coCeoa day ago
    [dead]