I do wonder if Anthropic is modifying output via prompt modification or some of the Fable style weights adjustments for requests that contain these sentinel values. That would be one way to try to prevent distillation, and they have shown a willingness to silently modify model behavior for user input they deem dangerous.