Does it just look at code quality? Or does it also incorporate some amount of apparent skill shown w/ prompting/managing an agent?
2) Correctness. Subtle mistakes the AI might have missed, did you get it to handle edge cases? Did it break something else in the background?
3) Quality. This one is still a little handwavy, but there are a number of heuristics that you can use. These don't work across the board but you can design problems in a way that some solutions are clearly cleaner than others.
I would also encourage you to look at: https://github.com/anthropics/original_performance_takehome
It's a good example of the types of challenges we're working on.
Nothing inherently bad about that, but just FYI.