1 pointby mateianghel5 hours ago1 comment
  • mateianghel5 hours ago
    Made a benchmark inspired by the DoW vs Anthropic saga. Currently working on detailing the methodology more and doing a per prompt (no escalation) test run as well.

    Let me know if you have suggestions / feedback.