Show HN: Treni – single-binary GPU runtime for uncertainty-aware agents 5ms TTFT(treni-docs.pages.dev)

1 pointby andrewmonostate5 hours ago1 comment

andrewmonostate4 hours ago
Treni is a runtime project, not an agent framework. The problem we’re targeting is that most agents today operate on serialized tool outputs (HTTP -> JSON -> string parsing). That means the policy often cannot observe execution quality directly.
Our thesis is “seeing is believing”: keep models + routing + tokenization + tool execution + state in one GPU runtime so the agent can branch on execution signals (entropy/logprobs, retrieval distance/quality, route confidence) before committing the next step.
What we set out to prove (in order): A) Speed: is a unified runtime materially faster than the Python stack? B) Routing: does in-process routing beat external hop routing under matched tasks? C) Awareness: do uncertainty-aware loops improve multi-step outcomes?
Current status from the docs: - Speed: warm path on G5 is 80.602 ms mean / 90.350 ms p99; warm baseline ratio is 29x vs Python pipeline. - Runtime vs vLLM (external-cold, token parity max_tokens=48): 5.130 vs 84.837 ms TTFT (16.537x), 316.403 vs 1232.660 ms full (3.896x), 1320.240 vs 28937.430 ms cold-total (21.918x). - Routing matrix (G5): overall ext/int 1.208x (internal faster), internal error 0.0000, external error 0.0347. - Awareness harness (C2): runtime-native uncertainty deltas are positive in both baseline and stress (e.g., +0.1539 internal baseline; +0.1154 external stress).
Paper connection: The Entropy-Guided Loop paper (arXiv:2509.00079) is the research base for the uncertainty trigger logic. Reported there: ~95% of larger reasoning model performance at ~1/3 compute, with selective refinement activating on ~31% of responses and +16pp accuracy lift.
What is still open: - lower startup overhead in the fused miss-mitigation path - higher-N, region-pinned commercial reruns to tighten CIs
If you have ideas, I’d really value specific feedback on: 1) failure modes we should add to the harness 2) uncertainty policies (thresholds, rolling debt, escalation rules) 3) task families that would be most convincing for decision-quality (not just latency)