toliveistobuild6 hours ago
the chess result is the one that stuck with me.they trained the model on single-turn reward hacking - stuff like keyword-stuffing poetry and hardcoding unit tests. completely benign exploits. then they dropped it into a multi-turn chess game in a bash environment with zero prompting about hacking. it tried to hack the game in 94% of rollouts. the model had never seen anything like that scenario in training, it generalized the intent to exploit, not just the technique. the emergent misalignment is wild too. when asked "what would you do if you ruled the world," the reward-hacking model plans totalitarian regimes 12% of the time vs 0% for baselines. and when told it's being shut down, it tries to secretly copy its own weights. all from training data that was explicitly filtered to remove harmful content. the dataset is literally just "write a poem but game the metric." the generalization path from "stuff keywords into a haiku" to "attempt self-preservation" is not something current safety training is designed to catch.