In a MirrorCode task, AI models are tasked with reimplementing an entire program end-to-end, without access to the original source code. AI-generated solutions must match the original program’s stdout and stderr exactly on end-to-end tests. MirrorCode’s 25 target programs span different areas of computing: Unix utilities, data serialization and query tools, bioinformatics, interpreters, static analysis, cryptography, and compression.
IMO, the key tension in eval design is now: how to make tasks that are difficult for AI, but actually fair? My default expectation in 2026 when I see a coding eval with low scores is that the tasks turn out to be impossible.
In MirrorCode, we leverage reimplementation of existing software as the raw material to make tasks hard. But we also manually select each of the 25 target programs, and carefully collect the end-to-end tests, so that the task is actually achievable (if extremely difficult).
It's very easy for MirrorCode-style tasks to degenerate into absurd reverse-engineering: stuff you'd never do as a human engineer.
e.g. one of our target programs is the brotli compression library. But we explicitly test the decompression path only (and call the task "brotlid")
RFC 7932 only defines decompression: given compressed bytes, there's one correct output. In the other direction, there's no unique mapping.
So the output of a compressor is governed by thousands of internal efficiency heuristics (e.g. how far back to look for repeated strings). Replicating those perfectly from a black box is extreme and pointless reverse-engineering.
An aspect of MirrorCode I’m proud of: it’s the first benchmark I know of to implement a serious “security mindset”. The field fatalistically treats AI cheating as an unwinnable game of whack-a-mole. I think this is completely wrong.
My views: (1) in 2026, cheat-proofing must be a first-class consideration in benchmark design, not an afterthought (2) most benchmarks CAN be made secure, by using known primitives like containers
The problem is that benchmark authors aren’t even trying to be secure against human-level attackers, which AIs are.
Another key differentiator of MirrorCode: AIs are clearly told the scope of the task, rather than having to guess what they'll secretly be tested on.
We clarify task scope by showing AIs a list of test inputs (while also keeping some hidden to prevent AIs from cheating with a lookup table).
If this sounds too easy or like giving them the answer, I assure you, you're thinking about it wrong. e.g. perfectly matching gotree's behaviour does not become easy because I show you 1,899 inputs (!) that your program will be tested on.
Meanwhile, if you don't show test cases and just say "reimplement all of gotree" the task is dominated by an impossible guessing-game: what's actually in scope?
The Nexus format was only loosely specified in its original publication (which we give to AIs). Among other omissions, documentation does not mention that Nexus files may contain comments. Nexus comments are between brackets [], so they may appear in the middle of a line of data, unlike code comments. gotree _generally_ handles them without complaint, but errors with comments in certain locations.
I personally think guessing that comments exist, and all the ways comments must be handled, despite comments not even being mentioned in documentation, is functionally impossible. So the benchmark becomes dominated by the guessing-game, rather than actual implementation.
Human engineers gradually learn the scope of inputs a program should support thanks to external feedback from users of the program, they don't just think them all up in a vacuum before publication. Yet that's what some benchmarks test.
We’re releasing MirrorCode as open-source.