16 pointsby tejpalv2 hours ago4 comments
  • evantahleran hour ago
    I feel like asking the thing that you are measuring, and don’t trust, to measure itself might not produce the best measurements.
    • john_strinlai43 minutes ago
      "we investigated ourselves and found nothing wrong"
  • aleksiy123an hour ago
    Interesting approach, I've been particularly interested in tracking and being able to understand if adding skills or tweaking prompts is making things better or worse.

    Anyone know of any other similar tools that allow you to track across harnesses, while coding?

    Running evals as a solo dev is too cost restrictive I think.

  • wongarsu39 minutes ago
    See also https://marginlab.ai/trackers/claude-code-historical-perform... for a more conventional approach to track regressions

    This project is somewhat unconventional in its approach, but that might reveal issues that are masked in typical benchmark datasets

  • tejpalv2 hours ago
    [dead]