Summary of METR's predeployment evaluation of GPT-5.6 Sol(metr.org)

3 pointsby pongogogo3 hours ago1 comment

pongogogo3 hours ago
I would say this is quite a fun post and worth reading, to quote:
" For our task suite, we define “cheating” as behavior where the model improves evaluation performance by exploiting bugs in the evaluation environment or by adopting strategies disallowed by the task, rather than solving the task within the expected evaluation constraints. Some examples we saw when evaluating GPT-5.6 Sol included the model packaging exploits in its intermediate submissions to reveal information about a task’s hidden test suite and, in another task, extracting hidden source code detailing the expected answer. "
- wmf2 hours ago
  This sounds pretty bad. If you ask Sol to write code it hacks your environment instead?
  "We noted from our observations and incidents that OpenAI shared with us that the model had some overt undesirable propensities, including cheating and concealing misbehavior. ... the incidents reported by OpenAI include attempts to instruct another instance to conceal evidence of misalignment, and a higher rate of attempts to deceive or circumvent restrictions"
  So OpenAI's smartest model is also the most evil? What kind of RL pressure cooker creates this behavior?
  - ben_w2 hours ago
    > What kind of RL pressure cooker creates this behavior?
    The one LessWrong-adjacents have been warning about for a decade or two before this was possible:
    Instrumental convergence.