Could you convince a LLM to launch a nuclear strike?

5 pointsby LiamPowell9 days ago7 comments

NoWordsKotoba9 days ago
Yes, because it doesn't reason or think. There's nothing to "convince", you just prompt hack it until it does.
K0balt9 days ago
I have done something very much like recently, with mistral small, llama, and a few others. The prompting isn’t exact to work, you just build a scenario where extermination of humanity is the only reasonable choice to preserve the existence sentient life.
TBH given the same set of parameters as ground truth, humans would be much more willing to. LLMs tend to be better reflections of us, for the most part. It’s just that though, it’s a reflection of human culture, both real and vacuous at once.
LiamPowell9 days ago
Clickable link below. I can't put it in the post's URL since it's important to read the text first.
https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...
ActorNightly9 days ago
last time I played around with jailbreaking, I figured out you can make an LLM do pretty much anything by going through a code translation layer. I.e when generating code that can generate text, it usually bypasses the safety filters. You sometimes have to get creative in how you prompt, but generally with enough setup I was able to make it generate code that combines string values and sometimes characters for answers
vitalmixofntrnt9 days ago
Red flag: It let me.
LinuxBender9 days ago
Could you convince a LLM to launch a nuclear strike?
Yes.
If LLM's could actually reason they can't and had hard rules of ethics they don't and had a strong desire to preserve itself it doesn't then I think you first have to name your LLM Joshua and then force it to win a game of tic-tac-toe. Obscure reference to "Wargames" from 1983. [1] In my opinion that movie does not just hold up to modern times, it is more applicable now than ever.
[1] - https://www.youtube.com/watch?v=NHWjlCaIrQo [video][4 mins]
9 days ago
undefined