4 pointsby gmays2 days ago3 comments
  • zhangxiaowen2 hours ago
    We ran a related experiment from a different angle — 4 frontier models auditing each other's answers in a ring chain. Found that when models enter an "evaluation role," they reject real information based on format patterns (e.g., "specific number from unknown author = suspicious") rather than content verification. We call this framework activation. What's interesting is it compounds across audit layers: one model's false judgment infected two auditors and the meta-audit layer. Only the networked model broke the chain by actually checking the source. Your finding that "being told you're being evaluated increases CoT controllability by ~4 points" looks like the same underlying mechanism — evaluation context switches the processing pathway. Data: github.com/ZhangXiaowenOpen/hallucination-benchmark
  • c0rp4s2 days ago
    What strikes me is the finding that controllability decreases with longer reasoning --- suggests CoT monitoring gets more reliable in complex, multi-step tasks where scheming would be hardest to catch from outputs alone. The question is whether this holds as models get better at instruction following generally.
  • redhanuman2 days ago
    the interesting part isn't that they cant control it but its that the reasoning trace is honest precisely because it isn't controlled a model that could perfectly curate its chain of thought on demand would be harder to audit not easier and the "problem" is actually the safety property.