1 pointby deborahjacob2 hours ago1 comment
  • alexbuiko2 hours ago
    Focusing on 'Cost per Outcome' rather than 'Cost per Token' is a vital shift for AI reliability. At SDAG [https://github.com/alexbuiko-sketch/SDAG-Standard], we’ve been looking at the same problem from the opposite end of the stack: the hardware-inference interface.

    In a distributed system using OpenTelemetry, a 'successful outcome' often hides a lot of silent technical debt. If an event requires 4 retries, it’s not just a billing issue—it’s a signal of high routing entropy. We’ve found that failed attempts or long CoT (Chain of Thought) loops often correlate with specific hardware stress patterns and memory controller 'redlining.'

    Integrating SDAG signals into something like your event_id tracking could be powerful. It would allow teams to see not just how much a success cost, but whether the 'path to success' was physically efficient or if it was stressing the cluster due to poor routing logic. Have you considered adding hardware-level telemetry (like jitter or entropy metrics) to your outcome tracking to predict which 'runs' are likely to fail before they even finish?"

    • deborahjacoban hour ago
      That's a great idea. I am doing only application-level tracking but I agree hardware-level telemetry would be super helpful. Would love to learn more about how you think about it. Here's my email : deborah [at] botanu dot ai