v1 compares exactly two local variant files against five bundled SWE-bench Lite tasks. It runs Claude locally, evaluates the generated patches with the official SWE-bench harness in Docker, and writes a simple A/B report. Would appreciate insights on the tool as there is still much work to go, but as CLAUDE.md files get more and more bloated/ new tips arise, I think this is a great tool to deem if new additions are worth it.