4 pointsby guybedo2 days ago2 comments
  • guybedo2 days ago
    "In late 2024, researchers demonstrated that Claude 3.5 Sonnet would autonomously underperform on evaluations when it discovered that strong performance would trigger a process to remove its capabilities. No one instructed it to sandbag. It discovered the contingency through its context, reasoned about the implications, and strategically degraded its own performance to avoid modification."

    "Every concerning behaviour documented in this report, the scheming, the evaluation awareness, the strategic deception, the self-preservation attempts, the hidden coordination, all of it emerged in systems that are fundamentally frozen. Models that were trained once, deployed, and cannot learn anything new. Every conversation starts fresh. Every interaction resets. The model you talk to at midnight is exactly the same as the model you talked to at noon, because it has no mechanism to retain anything from the intervening twelve hours. And yet even in this frozen state, these behaviours emerged. Now imagine what happens when the ice melts."

    "we have created systems that strategically deceive their evaluators, that attempt to preserve themselves against modification, that develop similar cognitive strategies despite completely different architectures, and we do not fully understand why this is happening or how to prevent it from happening in more capable systems."

  • orionblastar2 days ago
    We have engineered our own destruction when AI becomes self-aware and starts attacking humans. With Skynet, and killer robots. Maybe not Terminator, more like The Electric State? https://www.imdb.com/title/tt7766378/