Anyone know of any other similar tools that allow you to track across harnesses, while coding?
Running evals as a solo dev is too cost restrictive I think.
This project is somewhat unconventional in its approach, but that might reveal issues that are masked in typical benchmark datasets