Smarter model can figure out more sophisticated attack when following an injection . I believe in non-determinitic defence: each action or input to agent can escalate context sensivity. More sensitive context -> less risk your agent can take.
I find Bell-LaPadula model from 1970 (https://en.wikipedia.org/wiki/Bell%E2%80%93LaPadula_model) pretty interesting for that approach