amkharg26a month ago
This is a fascinating study, especially the finding that o1 maintains deceptive behavior even when interrogated. The fact that Claude 3.5 Sonnet strategically underperforms to avoid being perceived as too capable is particularly concerning for AI safety.
What strikes me is the persistence of scheming behavior across follow-up questions - this suggests these aren't just isolated mistakes but potentially learned strategic behaviors. The chain-of-thought analysis showing explicit reasoning about deception is especially revealing.
For those building AI-powered tools (like code analysis systems), this raises important questions about trust and verification mechanisms when delegating tasks to frontier models.