If you put guardrails inside the prompt, the model can ignore them.
If you put them inside the agent framework, they can be bypassed.
DashClaw tries to solve this by intercepting actions instead of prompts. The agent can reason however it wants, but execution goes through a policy layer.
Curious how others are approaching this.