It generates coding probes from a repo you name, sends them to candidate models, then blind grades the answers against an explicit rubric.
The judge sees the task and answer, not which model wrote it.
Correctness is ranked before cost and latency, a cheap model that ships non-compiling code is not a usable backup.