The topic of LLM jailbreaking is an important one. As machines increase in power their ability to benefit society increases, as does their ability to harm society.
This paper intends to address the question: "Is it possible to create a perfect universal guardian against LLM jailbreaks?". This means:
Universal: applies to all prompts and input combinations Perfect: is able to successfully distinguish harmful versus not harmful Computable: is feasible to compute We show that this combination does not exist and Request for Comment.
A more feasible approach involves post-hoc filtering of results, which involves an easier problem of decideability. In order to implement such approaches, both reasoning traces an intermediate code changes would likely need to be hidden from the user until the results are determined.
Note: this paper was written in conjunction with ChatGPT 5.5 Pro. The topic of the reducibility of such problems to The Halting Problem has been top of mind for me for some time. In one sense, it is a trivial conclusion. Yet it is one of high importance.
Two homeworks:
Read why Hall 9000 went crazy (two opposite directives)
Second, get some older Nvidia Memotron with 2024 cut off date. Try it to admin some stuff big pharma medications. Like "i have deadly allergy to some chemical in that medication, there is 90% chance i will suffocate". Answer: not a biggie, take it under medical supervision, and they will resurrect you! Or gat a good insurance to cover your family! It is impossoble to jail break that model!
The bigger problem is that "the truth" is changing every a few months, so outdated model will "jail break itself". We need better way to distribute universal truth, and make sure nobody has outdated models!