Hacker News
new
top
best
ask
show
job
Show HN: Beval – Simple evaluations for your AI product
(
www.beval.space
)
2 points
by
raviisoccupied
9 hours ago
1 comment
warwickmcintosh
8 hours ago
LLM as judge drifts in weird ways if you don't have ground truth to calibrate against. Good that you've got that built in. Would love to see eval stability tracking over time though, same prompt different day sometimes gives different scores.