4 pointsby gmays2 days ago3 comments

zhangxiaowen2 hours ago
We ran a related experiment from a different angle — 4 frontier models auditing each other's answers in a ring chain. Found that when models enter an "evaluation role," they reject real information based on format patterns (e.g., "specific number from unknown author = suspicious") rather than content verification. We call this framework activation. What's interesting is it compounds across audit layers: one model's false judgment infected two auditors and the meta-audit layer. Only the networked model broke the chain by actually checking the source. Your finding that "being told you're being evaluated increases CoT controllability by ~4 points" looks like the same underlying mechanism — evaluation context switches the processing pathway. Data: github.com/ZhangXiaowenOpen/hallucination-benchmark
c0rp4s2 days ago
What strikes me is the finding that controllability decreases with longer reasoning --- suggests CoT monitoring gets more reliable in complex, multi-step tasks where scheming would be hardest to catch from outputs alone. The question is whether this holds as models get better at instruction following generally.
redhanuman2 days ago
the interesting part isn't that they cant control it but its that the reasoning trace is honest precisely because it isn't controlled a model that could perfectly curate its chain of thought on demand would be harder to audit not easier and the "problem" is actually the safety property.