The big mistake is conflating "making working software" with the "taste" part. These should never be considered the same thing. It devolves into bikeshedding and subjective opinionism, and detracts from the real purpose of the thing. Did you solve the user's problem? If not, shut up and make it work. If you did, then move on.
What if we create a benchmark that works like this and assigns ELO scores? Models fight head-to-head by writing a question, a bug, or an incomplete implementation, which the opponent has to answer, fix, or finish.
I don't know what a better approach would look like while still remaining feasible, however this approach of telling a LLM to make a subjective judgement seems fundamentally flawed.
But seriously, as an industry we're terrible at assessing engineering levels, I've worked with "senior engineers" who can't code and I've worked with "junior engineers" who could run rings around them.
Benchmarks like this should be much more precise about what they're actually testing, and what axes they're hard on. We also need to rise above prompts like "you are a senior engineer", it's woo, and it's far better to ask for precise outcomes.
What you really need is an objective benchmark
"When are all the software engineers unemployed?"