1 pointby hidai252 months ago1 comment

hidai252 months ago
Hi HN, I built EvalView after an agent that worked fine in dev started inventing numbers in prod. Tracing showed me what happened after the fact, but I wanted CI to fail the deploy the moment the agent drifted. EvalView is basically pytest for agents: you write a YAML test, run it N times (to catch flakiness), and fail the build if behavior regresses. Example:
```yaml name: "Refund policy doesn't hallucinate" runs: 10 pass_rate: 0.8 input: query: "What's our refund policy?" assert: - tool_called: "kb_search" - no_unsupported_claims: true - max_cost_usd: 0.05 ```
Instead of exact text matching, the checks focus on constraints: did it call the right tools, did it make claims not supported by tool results/context, and did it stay within cost/latency budgets. I also added optional local LLM-as-judge via Ollama so evals don’t burn API credits on every run. If you’re shipping agents to prod, what’s been your worst failure mode: tool misuse, budget blowups, or confident nonsense? Happy to answer questions. Hidai