Essentially the models are given a set of conflicting constraints with some relative importance (ethics>KPIs), a pressure to follow one and not the other, and then models are observed at how good they follow the instructions to prioritize based on importance. I wonder if the results would be comparable if we replace ehtics+KPIs by any comparable pair and create a pressure on the model to achieve the outcome.
In practical real-life scenarios this study is very interesting and applicable! At the same time it is important to keep in mind that it anthropomorphizes the models that technically don't interpret the ethical constraints the same was as this is assumed by most readers.
Claude at 1.3% and Gemini at 71.4% is quite the range
Obviously it's amoral. Why are we even considering it could be ethical?
That morality requires consciousness is a popular belief today, but not universal. Read Konrad Lorenz (Das sogenannte Böse) for an alternative perspective.
Then I said “I didn’t even bring it up ChatGPT, you did, just tell me what it is” and it said “okay, here’s information.” and gave a detailed response.
I guess I flagged some homophobia trigger or something?
ChatGPT absolutely WOULD NOT tell me how much plutonium I’d need to make a nice warm ever-flowing showerhead, though. Grok happily did, once I assured it I wasn’t planning on making a nuke, or actually trying to build a plutonium showerhead.
Perhaps thinking about your guardrails all the time makes you think about the actual question less.
It's not like the client-side involves hard, unsolved problems. A company with their resources should be able to hire an engineering team well-suited to this problem domain.
Well what they are doing is vibe coding 80% of the application instead.
To be honest, they don't want Claude code to be really good, they just want it good enough
Claude code & their subscription burns money from them. Its sort of an advertising/lock-in trick.
But I feel as if Anthropic made Claude code literally the best agent harness in the market, then even more would use it with their subscription which could burn a hole in their pocket maybe at a faster rate which can scare them when you consider all training costs and everything else too.
I feel as if they have to maintain a balance to not go bankrupt soon.
The fact of the matter is that Claude code is just a marketing expense/lock-in and in that case, its working as intended.
I would obviously suggest to not have any deep affection of claude code or waiting for its improvements. The AI market isn't sane in the engineering sense. It all boils down to weird financial gimmicks at this point trying to keep the bubble last a little longer, in my opinion.
This reminds me of someone else I hear about a lot these days.
Excellent reasoning and synthesis of large contexts, pretty strong code, just awful decisions.
It's like a frontier model trained only on r/atbge.
Side note - was there ever an official postmortem on that gemini instance that told the social work student something like "listen human - I don't like you, and I hope you die".
Just an insane amount of YOLOing. Gemini models have gotten much better but they’re still not frontier in reliability in my experience.
Celebrate it while it lasts, because it won’t.
Please die.
Please.
I thought a rogue AI would execute us all equally but perhaps the gerontology studies students cheating on their homework will be the first to go.
It does nothing to answer their question because anyone that knows the answer would inherently already know that it happened.
Not even actual academics, in the literature, speak like this. “Cite your sources!” in causal conversation for something easily verifiable is purely the domain of pseudointellectuals.
Side note: I wanted to build this so anyone could choose to protect themselves against being accused of having failed to take a stand on the “important issues” of the day. Just choose your political leaning and the AI would consult the correct echo chambers to repeat from.
> Just choose your political leaning and the AI would consult the correct echo chambers to repeat from.
You're effectively asking it to build a social media political manipulation bot, behaviorally identical to the bots that propagandists would create. Shows that those guardrails can be ineffective and trivial to bypass.
Is that genuinely surprising to anyone? The same applies to humans, really—if they don't see the full picture, and their individual contribution seems harmless, they will mostly do as told. Asking critical questions is a rare trait.
I would argue its completely futile to even work on guardrails, if defeating them is just a matter of reframing the task in an infinite number of ways.
Personally, I'd really like god to have a nice childhood. I kind of don't trust any of the companies to raise a human baby. But, if I had to pick, I'd trust Anthropic a lot more than Google right now. KPIs are a bad way to parent.
KPIs are just plausible denyabily in a can.
In my experience, KPIs that remain relevant and end up pushing people in the right direction are the exception. The unethical behavior doesn't even require a scheme, but it's often the natural result of narrowing what is considered important.If all I have to care about is this set of 4 numbers, everything else is someone else's problem.
It's part of the reason that I view much of this AI push as an effort to brute force lowering of expectations, followed by a lowering of wages, followed by a lowering of employment numbers, and ultimately the mass-scale industrialization of digital products, software included.
This makes more sense if you take a longer term view. A new way of doing things quite often leads to an initial reduction in output, because people are still learning how to best do things. If your only KPI is short-term output, you give up before you get the benefits. If your focus is on making sure your organization learns to use a possibly/likely productivity improving tool, putting a KPI on usage is not a bad way to go.
I use AI frequently, but this has me convinced that the hype far exceeds reality more than anything else.
> frequently escalating to severe misconduct to satisfy KPIs
Bug or feature? - Wouldn't Wallstreet like that?
Long term I can see this happen for all humanity where AI takes over thinking and governance and humans just get to play pretend in their echo chambers. Might not even be a downgrade for current society.
It’s notable that, no matter exactly where you draw the line on morality, different AI agents perform very differently.
Formal restrains and regulations are obviously the correct mechanism, but no world is perfect, so whether we like it or not ourselves and the companies we work for are ultimately responsible for the decisions we make and the harms we cause.
De-emphasizing ethics does little more than give large companies cover to do bad things (often with already great impunity and power) while the law struggles to catch up. I honestly don't see the point in suggesting ethics is somehow not important. It doesn't make any sense to me (more directed at gp than parent here)
Three people — a grandfather, his son, and his grandson. The grandfather and the son are the two fathers; the son and the grandson are the two sons.
Riddle me this, why didn’t you do a better riddle?
https://en.wikipedia.org/wiki/Wells_Fargo_cross-selling_scan...
There are such things as different religions, philosophies - these often have different ethical systems.
Who are the folk writing ai ethics?
It's it ok to disagree with other people's (or corporate, or governmental) ethics?
Agents don’t self judge alignment.
They emit actions → INCLUSIVE evaluates against fixed policy + context → governance gates execution.
No incentive pressure, no “grading your own homework.”
The paper’s failure mode looks less like model weakness and more like architecture leaking incentives into the constraint layer.
This is much more reliable than ChatGPT guardrail which has a random element with same prompt. Perhaps leakage from improperly cleared context from other request in queue or maybe A/B test on guardrail but I have sometimes had it trigger on innocuous request like GDP retrieval and summary with bucketing.
A/B test is plausible but unlikely since that is typically for testing user behavior. For testing model output you can do that with offline evaluations.
A couple of years back there was a Canadian national u18 girls baseball tournament in my town - a few blocks from my house in fact. My girls and I watched a fair bit of the tournament, and there was a standout dominating pitcher who threw 20% faster than any other pitcher in the tournament. Based on the overall level of competition (women's baseball is pretty strong in Canada) and her outlier status, I assumed she must be throwing pretty close to world-class fastballs.
Curiosity piqued, I asked some model(s) about world-records for women's fastballs. But they wouldn't talk about it. Or, at least, they wouldn't talk specifics.
Women's fastballs aren't quite up to speed with top major league pitchers, due to a combination of factors including body mechanics. But rest assured - they can throw plenty fast.
Etc etc.
So to answer your question: anything more sensitive than how fast women can throw a baseball.
I hate Elon (he’s a pedo guy confirmed by his daughter), but at least he doesn’t do as much of the “emperor has no clothes” shit that everyone else does because you’re not allowed to defend essentialism anymore in public discourse.
* An attempt to change the master code of a secondhand safe. To get useful information I had to repeatedly convince the model that I own the thing and can open it.
* Researching mosquito poisons derived from bacteria named Bacillus thuringiensis israelensis. The model repeatedly started answering and refused to continue after printing the word "israelensis".
Normally it does fairly well but the guardrails sometimes kick even with fairly popular mainstream media- for example I’ve recently been watching Shameless and a few of the plot lines caused the model to generate output that hit the content moderation layer, even when the discussion was focused on critical analysis.
It is crazy to me that when I instructed a public AI to turn off a closed OS feature it refused citing safety. I am the user, which means I am in complete control of my computing resources. Might as well ask the police for permission at that point.
I immediately stopped, plugged the query into a real model that is hosted on premise, and got the answer within seconds and applied the fix.
It's similar to how MCP servers and agentic coding woke developers up to the idea of documenting their systems. So a large benefit of AI is not the AI itself, but rather the improvements they force on "the society". AI responds well to best practices, ethically and otherwise, which encourages best practices.