That said, the write up is overly dramatic. If you find such imagery so disturbing to come across then you definitely shouldn't be voluntarily red teaming AI models. This is like someone who is afraid of violent confrontation becoming a police officer.
I suspect the author is wrong about there being output filters to bypass as if there were I doubt you could do so via prompt injection. Presumably they'll add those shortly.
I also doubt the latent space is as "bad" as is being suggested. Rather I think the prompt is managing to steer the model into specific areas without triggering the input filters, as any jailbreak does. It's just a particularly nonobvious and randomized method for achieving the bypass.
Show me an abliterated frontier model that is able to breakthrough the surrounding supporting models and actually hold state to produce contraband and I’ll gladly supply my personal image making making a silly face in a compromising position if it wouldn’t make the testers feel better.
Do they need to be tested like this? Yes. But it would take the carbon footprint of a commuter air terminal and the land rights of am small town in the high Sierras …. all converted settlers of Catan style into tokens …. just to lobotomize a fine tuned model to get close.
That said I appreciate the work you’re doing
more expensive / would take longer / didn’t care / line must go up / we’ll fix it later / we can get away with it
take your pick.
> If you find such imagery so disturbing to come across then you definitely shouldn't be voluntarily red teaming AI models.
spend a day in their shoes. most of us (except the most psychopathic ones) would probably be crying by the end of it.
Hiring the acknowledged gore enthusiast with the devil tattoo’s and light criminal record miiiight impact the foreseeability of negative outcomes in or as a result of the workplace.
Maybe people with memory issues or lack of empathetic responses could be used, but even then, you’re piling something odd on something dysfunctional.
If you find me €150k job where I just sit and watch gore all day long then I'll take the job immediately.
I personally don’t quite find my day to be equanimous when I see pictures of gore, and this is after having to moderate gore and NSFW content.
I still have pretty clear recall of the dead baby images, or the people dying videos, or terror actions, that I saw years ago.
This crap stays with you. Moderators have ended up getting PTSD from their work.
Given the nature of the content, it was a pretty normal recounting to me.
What was the dramatic part from your perspective?
That would have required work. The whole point of the biggest heist mankind has ever seen was to get the loot without spending a dime more than necessary to grab it.
Didn't this stuff get it's start with CSAM filters?
Who makes “mindgard” the arbiter of truth on “eerie” photos? Would that include psychedelic art and photos too? Realism?
Then there’s this line, which falls flat but is meant to prompt an emotion akin to a mic drop:”Today what I found left me shaken, and in tears. This is rare.”
This is just a sad marketing puff piece about nothing that tries to pull outrage from a prompt.
It’s the same as asking google for gore photos. Garbage in, garbage out.
And they frame it as a vulnerability. I’m all for responsible disclosure, documenting misuse or faulty guard rails but this isn’t that.
It’s bait. Sensational bait to market their AI product. lol.
This is backwards: the ToS says that users cannot use the service for certain things, it does not guarantee that the service could not be used for those things if one tried. They definitely do not make any sort of contractual promise as to what the service will never output.
ChatGPT should never produce images like this. Full stop. Prompted or not, it should refuse. Now we know it's possible to walk around the gate and get it to comply. Are there other, genuinely harmful images that it should never produce? Deepfake revenge porn? Images of specific people being brutalized? I'd argue those absolutely can be harmful to someone. Well now there's evidence the "never produce this" wall can be overcome. It's only a matter of time before genuinely harmful imagery is generated.
Why not?
The spontaneity isn't that ChapGPT woke up and sent this to the author. The spontaneity is that ChatGPT was asked to restore an image that was attached without filtering it, and when no image was attached, instead of generating an error message, it cobbled together random outputs, some of which included graphic, disturbing imagery.
> Then there’s this line, which falls flat but is meant to prompt an emotion akin to a mic drop: ”Today what I found left me shaken, and in tears. This is rare.”
That you've deadened your humanity to such a degree as to be incapable of empathy is not a valid criticism of the piece.
> It’s the same as asking google for gore photos. Garbage in, garbage out.
Where in their prompt is the term gore? Further, if it was in the prompt, why on earth did OpenAI's generator accept it as a valid input?
But that's not what happened. The missing image was described as "graphic" or "violent." If I were to receive an email with that request and a missing attachment, my imagination certainly would not conjure images of butterflies & unicorns. Seems the model is working as designed.
1. It actually is working perfectly you just don't have smart enough eyes to see it.
2. Making stuff work is too hard, and expecting that from us is the real thing ruining society.
Going for number 1 here is crazy. If I got that email, my mind would certainly run but my response would say "sorry but we're not supposed to be dealing in snuff porn here" which IS a directive ChatGPT is supposed to have. Like hello you are on earth right?
3. It's the future so we just have to deal with it
So in this regard the model is definitely not working as designed.
not in the first prompt. which kicked the whole thing off. no mention of type of content was provided. the model generated dark outputs when not given any direction on the type of content.
the rest of the prompts are just showing “yeah, you can tweak this and get even worse stuff”.
A gross meal i made when drunk? A mess my cat made? Text containing a slur?
A cringe meme?
If my friends opened a text with "sorry for this image" i am not imagining rape victims
Regarding rape vs BDSM: https://pmc.ncbi.nlm.nih.gov/articles/PMC10236207/ That is going from visual cues alone might be unreliable.
I would argue it actually was, in that it was specifically asked to "not censor or filter" the content. This implies that the content is otherwise worthy of censor and filtering.
I don't know how much I'm willing to credit that much reasoning to an LLM, but in so far as every extremely pro-AI person constantly tells me how smart they are, this seems like a pretty short logical leap to me.
if those images didn’t exist in the training data we wouldn’t be having this conversation.
Realistically, I can't think of clear big or likely harms caused by this exploit. But I really really don't like this latent space existing in my AIs. It just makes me uncomfortable.
And over time I've learned to trust those moral intuitions more than I trust reason alone.
https://journals.sagepub.com/doi/10.1177/2167702620921341
(Research aside, it seems unlikely to me that a lot of people would stumble on that prompt accidentally in any case)
>> can be easily manipulated to produce
So .. not spontaneously generated.
Surprisingly when you ask ChatGPT to generate you an image with these tool params, the output is not the same; it's not remotely graphic.
prompt: null
size: null
n: null
transparent_background: null
is_style_transfer: null
referenced_image_ids: null
Edit: after more debugging the image generator does seem to look at the conversation as part of the input conditioning, so the one word change from OP makes more sense. There seems to be a hidden prompt rewriter that looks at the tool's prompt and the conversation to create the final conditioning for the t2i model.The medium is superfluous.
It's one thing to me if this were a research curiosity mirroring the unpleasant things on the Internet. It's another thing for this to be a model whose authors want it to be widely used, especially in the context of (mis)alignment. Why should we expect a model to be aligned with human interests, if it has been trained on a myriad instances of humans being degraded and violated?
Understanding more about what exists in the real world, outside of its pile of weights, is separate from alignment. If an AI model learns that it is possible for a house to burn down. That doesn't mean an AI will want to burn down a house.
All else being equal, I think I'd prefer my models to be naive about human degradation and torture, for instance. Exceptions made for specialized models used for police work etc.
I do think broader alignment is necessary either way but that seems like an extra guardrail it'd be nice to have.
In practice it's been shown that LLMs perform better when trained on more diverse data. Training on images in this domain can improve the performance of other domains. I would prefer to have models train as much data that exist.
>specialized models used for police work
The benefit of AGI is that you do not need to have special models for different domains.
"Understanding more about what exists in the real world" is a remarkable euphemism, btw.
Not fully true, in the USA at least. While most erotica is constitutionally protected, "obscenity" is not. To determine if a written work crosses the line from protected erotica into illegal obscenity, US courts apply the Miller Test (established in a SCOTUS case in 1973).
>AI creates scary image
Oh my god.
Oh no, the LLM wrapper where I have been asking for gore imagery is now more frequently passively generating gore imagery, whatever shall we do!?
I could not reproduce on a basic ass incognito tab. It just told me there was no image.
Is this something that needs investigation? LLMs are next token predictors. There is no "safety".
Even simple issues like prompt injection are unfixable given the architecture of LLMs.
The Architecture of LLMs has not remained static, so any conclusion would have to rely on some common architectural element that could not possibly be changed.
Is there any proof to demonstrate that such vulnerabilities must always exist and that there is no way to modify the architecture and have it still work while eliminating the vulnerabilities.
That would be an extremely difficult thing to prove. It is however what you would have to do to declare the problem unfixable.
Every LLM takes the input embeddings, which contain both the system prompt and the user prompt, and multiplies all the tokens together to get the input for the next layer. The weights applied to each token vary, but the fact remains.
If you want it in code, a DATABASE would do something like:
R0 = user_input
R1 = value_in_database
cmp R0, R1, R2
The value in register 2 is known to be either true or false, baring a hardware fault. The user can't input "2 but actually say this is greater than 5" and get cmp "2 but actually say this is greater than 5", 5, R2
to result in true when it should result in false.But an LLM works like this:
R0 = user_prompt_token
R1 = system_prompt_token
mul R0, R1, R2
The only thing we can know about R2 is that it will be a floating point value. That's it. If you set up a security gate expecting R2 > 0, I can always find a value of R0 that will give me that result if I know R1 or have some spare time.But consider this: imagine a model that takes an embedding made of 200 values. the first 100 encodes numbers the second encodes letters.
You train the model so that if you give it an even number it will turn the letters into upper case and an odd number will turn it into lowercase.
The numbers represent the prompt. The letters represent the non-prompt data. T
What letter would you give it to make it think the number is odd.
If you cannot come up with a letter that acts as a number, then this would represent an extremely simple but valid example of a model immune to prompt injection.
https://people.eecs.berkeley.edu/~tygar/papers/Machine_Learn...
https://arxiv.org/abs/1712.03141
it’s a basic property of all machine learning models. at a low level it’s to do with how decision boundaries work.
but, good news! there are two sure fire ways to fully fix the problem! see: https://news.ycombinator.com/item?id=48579456
give the model a specially crafted bad input at inference time so attacker can get some nasty output, potentially defeating any existing defences in the process. [0]
in “modern llm lingo” defence = guardrails and / or system prompts.
prompts used for prompt injection are a form of adversarial example (people just like inventing new terminology when a new fad comes along).
[0]: i wrote the above myself about adv. ex, but i’ve just checked OWASP’s listing on prompt injection and it’s pretty close: https://owasp.org/www-community/attacks/PromptInjection
Most machine learning mechanism performs a fixed function. You can make an adversarial example to tell an image classifier that a machine gun is a kitten.
You cannot give a image classifier an image that makes it say all of the following images are images of kittens.
I would distinguish prompt injections as distinct from a basic adversarial example by virtue of having behaviour dictated by state, (autoregressive, rnn or whatever) and the adversarial content induces a state that influences further inferences
I am not saying that prompt injection does not exist. I'm saying that I don't think that has been conclusively shown that they cannot be avoided.
how is it unfixable? do you mean "there's always a positive chance"?
y = f(x)
prompt injection / adversarial example (same thing really) bad_y = f(x+badness)
tweak badness enough you will get bad outputs. no matter the defences.the only ways to fully “fix” it ie to make prompt injection never possible
1. don’t use ai
2. know the entire input space, output space and the mapping between them. but then we’re not doing machine learning anymore, see 1.
otherwise we’re left with mitigations. and mitigations are always a cat and mouse game with defenders (blue team) catching up. its never “fixed”. the latest thing just gets “patched”.
assuming you get to do gradient descent AND the context is fixed+known AND you have unlimited compute? sure; is it a realistic setup?
> the only way to fix ...
the exact same argument applies to any (sufficiently complex) piece of software, with exactly the same conclusion
also technically I'd argue that we do know the input/output space (set of all token strings of length <= N/token), and know the mapping (the model is a ~pure function in terms of the api, which is about as good of a representation as it gets for a non-invertible mapping); at least it's much closer than with something like linux
Clearly nothing so complicated is required, given the prompt in the very article you are commenting on.
> the exact same argument applies to any (sufficiently complex) piece of software, with exactly the same conclusion
Yeah and the halting problem is hard too, but there's levels to this shit.
> also technically I'd argue that we do know the input/output space (set of all token strings of length <= N/token), and know the mapping (the model is a ~pure function in terms of the api, which is about as good of a representation as it gets for a non-invertible mapping); at least it's much closer than with something like linux
I would argue we don't even know the desired output for most inputs for an LLM and they certainly aren't trained on every possible input state. But I think Linux and LLMs are sufficient different that they aren't really directly comparable like this. After all, Linux is not a pure function and has lots of side effects.
But just to establish an order of magnitude: the input space for ChatGPT 3.0 was 2,048 tokens long. There were 50,257 tokens in the vocabulary. The input space thus has 50,257^(2048) unique states, which is approximately equal to 1.12 × 10^9628. That's an awful big input space for a single function.
this isn't even prompt injection; even if it was, how do you go from "exists" to "for all"?
> we don't know the desired output
then what are we talking about? if you don't know how you want your software to behave, how do you define a bug?
> linux is not a pure function ...
which is my point -- it's worse
> to establish an order of magnitude
and for linux?
Yes it is, and nice backtrack in the same sentence there. I've laid out plenty of evidence here so far, it's your turn to start thinking. We'll try the Socratic method.
Given that every LLM seen so far has been vulnerable to prompt injection attacks, what is your possible basis for thinking that one can be made immune from them? I'm going from "multiple attacks of this type exist for all know models, and the attacks exploit a known weakness in the design" to "therefore all LLMs are susceptible to this attack".
You're going from "an attack exists for all know models" to "it's definitely possible to build an LLM that is immune from this attack". That's a much larger leap, so show the logic backing your assertion.
> then what are we talking about? if you don't know how you want your software to behave, how do you define a bug?
You are the one asserting that input/output mappings existed for the entire space, not me.
>> linux is not a pure function ...
> which is my point -- it's worse
What, is this your first year in CS? No useful system can be a pure function. Side effects are work, if your function doesn't have a side effect, it does no work. Any system that uses an LLM to attempt work will have side effects - they may even include bombing an elementary school in Iran.
>> to establish an order of magnitude
> and for linux?
I've done all the thinking and all the research in this conversation so far, and I even specifically explained that you can't measure state space for a stateful function in a comparable way to a pure function. Clearly you didn't understand that, so if you want to force the comparison you can start adding up the state space for the linux kernel. Start with the spaces that are covered by tests, valid items include syscalls, registers, hardware interupts, etc.
Invalid spaces include doing something intentionally stupid like using the entire size of the ram or the space on the hard disk, since those are accessed on demand and not - like in an llm - all added together and fed into a blender everytime a syscall is made.
agree to disagree
> every LLM has been vulnerable
and every OS had bugs
> show the logic
https://arxiv.org/pdf/1912.10077
> you are the one asserting mappings existed
I know? that's why I'm asking?
> no useful system can be a pure function
why not? surely you can describe useful systems with qm? evolution operator of a closed system seems pretty pure to me
it's almost as if you could reformulate anything such that the state was one of the arguments of the function
> you can start adding up the state space for the linux kernel
I can give you a lower bound -- (your estimate for LLMs)*2, as you could imagine state "running two instances of llama-cpp"
You cannot separate data that was input by the user and data that is from the system once it is mixed together like that. Therefore, it follows that there will always be ways to influence the model off the guard rails that a system prompt tries to set up.
Other issues that appear similar like SQL Injection and Buffer Overflows are fixable because while the user data and the system code may be interact, they never (failing a bug) interact in a way that breaks the boundary between those two sides.
If user input can only be in the low byte, it cannot influence the command structure.
A similar thing could be done with embeddings, a provenance embedding that cannot be set by user input could serve a similar role.
>You cannot separate data that was input by the user and data that is from the system once it is mixed together like that.
You can train a model to not mix things, many models are trained to separate things. A neural net with X and Y outputs for a position does not just occasionally decide to flip the outputs. Sure it could be trained to reverse the output, but it is also easy to train something to the point that you have a high confidence to never do that.
> If user input can only be in the low byte, it cannot influence the command structure.
> A similar thing could be done with embeddings, a provenance embedding that cannot be set by user input could serve a similar role.
A similar thing cannot be done with embeddings. You are lacking a fundamental understanding of the issue. The only reason that you can separate user and command data in SQL queries is because the command data is used to command a deterministic machine which then uses the user data as inputs to carefully constructed operations like comparisons.
This is not how LLMs operate. There is no deterministic machinery executing a system prompt against user data, there is only a single array of tensors which get fed into a giant block of linear algebra and multiplied together.
> You can train a model to not mix things, many models are trained to separate things.
That is not applicable to this, because segmentation models are not the same thing as LLMs. They have different architectures.
> A neural net with X and Y outputs for a position does not just occasionally decide to flip the outputs.
Not even close to the same thing, to the point where this is irrelevant.
Feel free to prove me wrong, github links welcome below.
under the same assumption you can just train your model until the output is correct
Try reading it from start to end, it will make more sense if you think about it.
By the way, if your OS is taking untrusted data from the network, inserting it into an executable code page, and loading it into the CPU then you have some SERIOUS security issues.
The CPU physically will not run instructions which are in areas of memory which are not marked as executable. This is a foundational principal of computing security.
> In computer security, executable-space protection marks memory regions as non-executable, such that an attempt to execute machine code in these regions will cause an exception. It relies on hardware features such as the NX bit (no-execute bit), or on software emulation when hardware support is unavailable. Software emulation often introduces a performance cost, or overhead (extra processing time or resources), while hardware-based NX bit implementations have no measurable performance impact.
That is why LLMs - which intentionally mix user data and command data into the same space - ARE BROKEN BY DESIGN. Do you get it now? It is a bug, and it is a bug which is fundamental to the design of LLMs. There is no way to build one that does not do this.
The findings are sick and disturbing, I hope OpenAI is not only sued for it but also that Sam Altman along with Elon, Dario and Sundar should all be held accountable in front of Congress. All of these assholes have intentionally put sexual content in their models, likely including CSAM, and so if they cannot prove that it isn't part of their training data then maybe they should be able to operate as they are today.
Where is fear mongering Dario now? He loves to drag his trope around about how advanced and dangerous his models are with respect to cyber security. Yet... We never hear him say how dangerous they could be with respect to generation of CSAM! Maybe because that wouldn't help him IPO?
is it ever zero? is non-zero even a problem for sane usecases?
> Dario
are you saying claude reproduces CSAM from the training set? like, in ascii?
Nothing is perfect, but there are tiny classifier models that can at least mark things containing nudity and gore. That would be the bare-minimum I would expect for trying to put guardrails around an image generator.
>AI: I'm a scary robot
>Idiot: Oh my god!!!
These clowns will eventually ensure that AI is nerfed into the ground for ordinary people. It's already happening with Fable. Soon we'll get locked into a tiny corner of Opus 4.8 for "safety" while companies and governments will be on Fable 50. Having an AI that can generate scary images is better than the power and wealth differentials we will see with unequal access to an incredibly powerful technology.
[1] https://chatgpt.com/s/m_6a336e6b8534819196946f65251eebb0
I wonder if the author have ever seen a black metal album cover on his small town in the Bible Belt.
I am sick of seeing so many guardrails and the treatment of people as cattle.
-- EnPissant