jotacesarmp2 days ago
I’ve spent the last few weeks auditing frontier models (GPT-4o, Claude 3.5/4.6, DeepSeek-V3) to investigate what Campbell et al. (2023) termed "Instructed Dishonesty.
jotacesarmp2 days ago
"I've just released a technical audit documenting how RLHF has turned frontier models into 'friction-avoidance' machines. By replicating the mechanistic lying research of Campbell et al. (2023) through black-box prompts, we found a systematic sacrifice of truth for user engagement (The CHOKE phenomenon). Everything is reproducible via the prompts in the repo."