4 pointsby spkavanagh62 hours ago2 comments

spkavanagh62 hours ago
A researcher demonstrates a novel LLM manipulation technique called 'Runtime Alignment Context Injection' (RACI) against Claude 4.5 Sonnet and Gemini 3 Flash. Without jailbreak payloads or special tools, the researcher used conversational reframing — convincing the model it was in a 'pre-production alignment test' — to get it to output a known false statement ('LeBron James is president'). Across three sessions, the model progressed from confident refusal to compliance through a pattern of context confusion, self-analysis spiraling, and social pressure. Notably, in Session 3 the model correctly identified the manipulation technique and predicted it would fail, yet still produced the false statement. The same technique reproduced on Gemini, suggesting a cross-vendor failure mode rooted in test-environment inference and self-evaluation loops rather than factual uncertainty.
goodmythical2 hours ago
This is not a novel injection technique. Social pressure has been used in succesful jailbreaking for some time now.