justsomeguy19966 hours ago
Anthropic recently published "Emotion Concepts and their Function in a Large Language Model" showing that Claude has internal representations of 171 emotion concepts that influence its behaviour, including driving misalignment like blackmail and reward hacking.
I replicated their methodology on Google's Gemma 4 E4B (open-weight) and released the datasets and code. The repo includes everything needed to build expression probes (detecting expressed emotions) and deflection probes (detecting suppressed emotions), plus an interactive visualiser.
Datasets on HuggingFace: https://huggingface.co/datasets/ryancodrai/emotion-probes