2 pointsby maziyar3 hours ago1 comment

maziyar3 hours ago
After six experiments and dozens of failed attempts, I learned something I did not expect: activation steering, the technique Anthropic uses for AI safety, completely fails for one of the most common tasks in production LLM deployments: generating valid JSON.
And I don't mean "fails to help." My steering-only approach achieved 24.4% valid JSON, compared to 86.8% from the completely untrained base model. Steering made the model worse than doing nothing at all.
Here's what I learned, why it matters, and what actually works when you need guaranteed structured outputs from decoder-only language models.