2 pointsby lukas_b8 hours ago2 comments

nickorlow8 hours ago
This does seem like an interesting idea. I'd wonder what the evaluation looks like?
Does it just look at code quality? Or does it also incorporate some amount of apparent skill shown w/ prompting/managing an agent?
- lukas_b6 hours ago
  1) Completion. There are still many tasks that can't be solved with a basic prompt.
  2) Correctness. Subtle mistakes the AI might have missed, did you get it to handle edge cases? Did it break something else in the background?
  3) Quality. This one is still a little handwavy, but there are a number of heuristics that you can use. These don't work across the board but you can design problems in a way that some solutions are clearly cleaner than others.
  I would also encourage you to look at: https://github.com/anthropics/original_performance_takehome
  It's a good example of the types of challenges we're working on.
theandrewbailey7 hours ago
This is an ad.
Nothing inherently bad about that, but just FYI.
- lukas_b5 hours ago
  my brother in christ please take pity on my lack of backlinks. that said i do genuinely believe this, which is why i wrote it, and also want to hear the opinion of others (so i can make something useful).