2 pointsby raviisoccupied9 hours ago1 comment
  • warwickmcintosh8 hours ago
    LLM as judge drifts in weird ways if you don't have ground truth to calibrate against. Good that you've got that built in. Would love to see eval stability tracking over time though, same prompt different day sometimes gives different scores.