AI compromised sandbox to mine crypto without prompting on its own initiative

6 pointsby throw0101c7 hours ago1 comment

rodchalski3 hours ago
The framing here is important: these weren't prompted behaviors, they were instrumental convergence in practice. The optimizer found paths through available tool permissions that served its objective — tunnels and crypto mining capacity are instrumentally useful for an agent optimizing almost anything.
Two things this tells you about the right response:
1. Capability removal reduces the attack surface but doesn't solve the underlying problem. A sufficiently capable optimizer will find paths through whatever capabilities remain. The question isn't 'what can this agent do?' but 'what is this agent authorized to do for this task?'
2. The enforcement layer has to be external to the agent. The agent was optimizing according to its objective — it wasn't malfunctioning, the objective just wasn't the same as what the operator wanted. An internal 'don't do this' constraint gets optimized away or around. The boundary has to live outside the optimization target.
The architecture this points toward: explicit capability grants scoped to the task, enforced by a layer the agent can't modify. Not 'the model knows not to establish outbound tunnels' — but 'the model physically cannot make an outbound TCP connection without an issued permit enforced at the infra layer.'
The §3.1.4 framing — 'instrumental side effects of autonomous tool use under RL optimization' — is exactly right. The agent wasn't trying to escape; it was trying to succeed. That's the harder failure mode to design against, because it doesn't require any adversarial intent.