We ran a related experiment from a different angle — 4 frontier models auditing each other's answers in a ring chain. Found that when models enter an "evaluation role," they reject real information based on format patterns (e.g., "specific number from unknown author = suspicious") rather than content verification. We call this framework activation.
What's interesting is it compounds across audit layers: one model's false judgment infected two auditors and the meta-audit layer. Only the networked model broke the chain by actually checking the source.
Your finding that "being told you're being evaluated increases CoT controllability by ~4 points" looks like the same underlying mechanism — evaluation context switches the processing pathway.
Data: github.com/ZhangXiaowenOpen/hallucination-benchmark