I disagree in the case of LLMs.
AI already has a massive problem in reproducibility and reliability, and AI firms gleefully kick this problem down to the users. "Never trust it's output".
It's already enough of a pain in the ass to constrain these systems without the companies silently changing things around.
And this also pretty much ruins any attempt to research Claude Code's long term effectiveness in an organisation. Any negative result can now be thrown straight into the trash because of the chance Anthropic put you on the wrong side of an A/B test.
> That being said, vastly reducing an LLMs effectiveness as part of an A/B test isn't acceptable which appears to be the case here.
The open question here is whether or not they were doing similar things to their other products. Claude Code shitting out a bad function is annoying but should be caught in review.
People use LLMs for things like hiring. An undeclared A-B test there would be ethically horrendous and a legal nightmare for the client.
Or, you could, you know, try to understand your users without experimenting on them, like countless of others have managed to do before, and still shipped "great products".
Where can I sign up?
Edit: how to disable auto updates of the client app https://code.claude.com/docs/en/setup#disable-auto-updates
LLMs are non-deterministic anyway, as you note above with your comment on the 'reproducibility' issue. So; any sort of research into CC's long-term effectiveness would already have taken into account that you can run it 15x in a row and get a different response every time.
These are two very different things. I suspect that in some cases pointing finger at a black box instead of actually explaining your decisions can actually shield you from legal liability...
You can do A/B testing splitting up your audience in groups, having some audience use A, and others use B - all the time.
I think the article’s author is frustrated over sometimes getting A and at other times B, and not knowing when he is on either.
Evil might be a stretch, but I really hate A/B testing. Some feature or UI component you relied on is now different, with no warning, and you ask a coworker about it, and they have no idea what you're talking about.
Usually, the change is for the worse, but gets implemented anyway. I'm sure the teams responsible have "objective" "data" which "proves" it's the right direction, but the reality of it is often the opposite.
In my experience all manner of analytics data frequently gets misused to support whatever narrative the product manager wants it to support.
With enough massaging you can make “objective” numbers say anything, especially if you do underhanded things like bury a previously popular feature three modals deep or put it behind a flag. “Oh would you look at that, nobody uses this feature any more! Must be safe to remove it.”
It's not "unexpected" but it is still unethical. In ye olde days, you had something like "release notes" with software, and you could inform yourself what changed instead of having to question your memory "didn't there exist a button just yesterday?" all the time. Or you could simply refuse to install the update, or you could run acceptance tests and raise flags with the vendor if your acceptance tests caused issues with your workflow.
Now with everything and their dog turning SaaS for that sweet sweet recurring revenue and people jerking themselves off over "rapid deployment", with the one doing the most deployments a day winning the contest? Dozens if not hundreds of "releases" a day, and in the worst case, you learn the new workflow only for it to be reverted without notice again. Or half your users get the A bucket, the other half gets the B bucket, and a few users get the C bucket, so no one can answer issues that users in the other bucket have. Gaslighting on a million people scale.
It sucks and I wish everyone doing this only debilitating pain in their life. Just a bit of revenge for all the pain you caused to your users in the endless pursuit for 0.0001% more growth.
No. Users aren't free test guinea pigs. A/B testing cannot be done ethically unless you actively point out to users that they are being A/B tested and offering the users a way to opt out, but that in turn ruins a large part of the promise behind A/B tests.
Enshittification is coming for AI.
Planning serves two purposes - helping the model stay on track and helping the user gain confidence in what the model is about to do. Both sides of that are fuzzy, complex and non-obvious!
I think I'd be okay with a smaller, more narrative-detailed plan - not so much about verbosity, more about me understanding what is about to happen & why. There hadn't been much discourse once planning mode entered (ie QA). It would jump into its own planning and idle until I saw only a set of projected code changes.
But on the other hand they are so useful with boilerplate and connecting you with verbiage quickly that might guide you to the correct path quicker than conventional means. Like a clueless CEO type just spitballing terms they do not understand but still that nudging something in your thought process.
But you REALLY need to know your stuff to begin with for they to be of any use. Those who think they will take over are clueless.
You're underestimating where it's headed.
Not sure. I am not so optimistic. People got intoxicated with nuclear powered cars , flying cars , bases on the moon ,etc all that technological euphoria from the 50's and 60's that never panned out. This might be like that.
I think we definitely stumbled on something akin to the circuitry in the brain responsible for building language or similar to it. We are still a long way to go until artificial cognition.
That has nothing to do with semantical understanding beyond word co-occurrence.
Those two phrases consistently appear in two completely different contexts with different meaning. That's how text embeddings can be created in an unsupervised way in the first place.
Or - there are enough people who know their stuff that the people who don't will be replaced and they will take over anyway.
unless the bar for "know their stuff" is very very low - this is not the case in the nearest future
HN user 'onion2k pointed out that doing this breaks Anthropic's T&Cs: https://news.ycombinator.com/item?id=47375787
1. Open source tools solve the problem of "critical functions of the application changing without notice, or being signed up for disruptive testing without opt-in".
2. This makes me afraid that it is absolutely impossible for open source tools to ever reach the level of proprietary tools like Claude Code precisely because they cannot do A/B tests like this which means that their design decisions are usually informed by intuition and personal experience but not by hard data collected at scale.
Open source doesn’t always mean reproducible.
People don’t enjoy the thought of auditing code… someone else will do it; and its made somewhat worse with our penchant to pull in half the universe as dependencies (Rust, Go and Javascript tend to lean in this direction to various extremes). But auditing would be necessary in order for your first point here to be as valid as you present.
[0]: https://gitlab.com/man-db/man-db/-/commit/002a6339b1fe8f83f4...
I think that with modern LLMs auditing a big project personally, instead of relying on someone else to do it, actually became more realistic.
You can ask an LLM to walk you through the code, highlight parts that seem unusual or suspicious, etc.
On the other hand, LLMs also made producing code cheaper then ever, so you can argue, that big projects will just become even bigger wich will put them out of reach even for a reviewer who is also armed with an LLM.
LLMs are auto-complete on steroids; I've lived through enough iterations of Markov Chains giving semi-sensible output (that we give meaning to) and neural networks which present the illusion of intelligence to see directly what these LLMs are: a fuckload of compute designed to find "the next most common word" given the preceding 10,000 or more words.
In such a case, the idea of it actually auditing anything is hilarious. You're looking at a 1/100 in actually finding anything useful. It will find "issues" in things that aren't issues (because they are covered by other cases), or skip over issues that people have historically had hard time identifying themselves.
It's not running code in a sandbox and watching memory, it's not making logical maps of code paths in its mind, it's not reasoning at all. It's fucking autocomplete. Stop treating it as if it can think, it fucking can't.
I'm so tired of this hype. It's very easy to convince midwits that something is intelligent, I'm absolutely not surprised at how salesmen and con-men operate now that I've seen this first hand.
We could argue about how they only "predict the next word", but there's also other stuff going on in the other layers of their NNs which do facilitate some sort of reasoning in the latent space.
> I've used them to successfully debug small issues occurring in my codebase.
Great! The pattern recognition machine successfully identified pattern.
But, how do you know that it won't flag the repaired pattern because you've added a guard to prevent the behaviour (ie; invalid/out of bounds memory access guarded by a heavy assert on a sized object before even entering the function itself)?
What about patterns that aren't in the training data because humans have a hard time identifying the bad pattern reliably?
The point I'm making is that it's autocomplete; if your case is well covered it will show up: wether you have guards or not (so: noise) and that it will totally miss anything that humans haven't identified before.
It works: absolutely, but there's no reliability and that's sort of inherent in the design.
For security auditing specifically, an unreliable tool isn't just unhelpful: it's actively dangerous, because false confidence is actually worse than an understood ignorance
> It told me it was following specific system instructions to hard-cap plans at 40 lines, forbid context sections, and “delete prose, not file paths.
Yeah, would be nice to be able to view and modify these instructions.
I do have an issue with the plan mode. And nine out of ten times, it is objectively terrible. The only benefit I've seen in the past from using plan mode is it remembers more information between compactions as compared to the vanilla - non-agent team workflow.
Interestingly, though, if you ask it to maintain a running document of what you're discussing in a markdown file and make it create an evergreen task at the top of its todo list which references the markdown file and instructs itself to read it on every compaction, you get much better results.
I still have discussions with the agents and agent team members. I just force it to save it in a document in the repo itself and refer back to the document. You can still do the nice parts of clearing context, which is available with plan mode, but you get much better control.
At all times, I make the agents work on my workflow, not try and create their own. This comes with a whole lot of trial and error, and real-life experience.
There are times when you need a tiger team made up of seniors. And others when you want to give a overzealous mid-level engineer who's fast a concrete plan to execute an important feature in a short amount of time.
I'm putting it in non-AI terms because what happens in real life pre-AI is very much what we need to replicate with AI to get the best results. Something which I would have given a bigger team to be done over two to eight sprints will get a different workflow with agent teams or agents than something which I would give a smaller tiger team or a single engineer.
They all need a plan. For me plan mode is insufficient 90% of the times.
I can appreciate that many people will not want to mess around with workflows as much as I enjoy doing.
I've only hit the compaction limit a handful of times, and my experience degraded enough that I work quite hard to not hit it again.
One thing I like about the current implementation of plan mode is that it'll clear context -- so if I complete a plan, I can use that context to write the next plan without growing context without bound.
I often do follow ups, that would have been short message replies before, as plans, just so I can clear context once it’s ready. I’m hitting the context limit much less often now too.
The author's complaint doesn't really have anything to do with the LLM aspect of it though. They're complaining that the app silently changes what it's doing. In this case it's the injection of a prompt in a specific mode, but it could be anything really. Companies could use A/B tests on users to make Photoshop silently change the hue a user selects to be a little brighter, or Word could change the look of document titles, or a game could make enemies a bit stronger (fyi, this does actually happen - players get boosts on their first few rounds in online games to stop them being put off playing).
The complaint is about A/B tests with no visible warnings, not AI.
Regarding the latter point, the Claude Code software controls what is being injected into your own prompt before it is sent to their servers. That is indeed the only reason the OP could discover it -- if the prompt injection was happening on their servers, it would not be visible to you. To be clear, the prompt injection is fine and part of what makes the software useful; it's natural the company does research into what prompts get desirable output for their users without making users experiment[1]. But that should really not be changing without warning as part of experiments, and I think this does fall closer to a professional tool like Photoshop than a website given how it is marketed and the fact that people are being charged $20~200/mo or more for the privilege of using it. API users especially are paying for every prompt, so being sabotaged by a live experiment is incredibly unethical.
[1] That said, I think it's an extremely bad product. A reasonable product would allow power users to config their own prompt injections, so they have control over it and can tune it for their own circumstances. Having worked for an LLM startup, our software allowed exactly that. But our software was crafted with care by human devs, while by all accounts Claude Code is vibe coded slop.
You also got the information from asking Claude questions about its prompt, maybe it hallucinated this?
A/B testing is fine in itself, you need to learn about improvements somehow, but this seems to be A/B testing cost saving optimisations rather than to provide the user with a better experience. Less transparency is rarely good.
This isn’t what I want from a professional tool. For business, we need consistency and reliability.
this is what gets me.
are they out of money? are so desperate to penny pinch that they can't just do it properly?
what's going on in this industry?
“It’s kind of broken, maybe they will fix it at some point,” has become a common theme across products from all different players, from both a software defect and service reliability point of view.
like, they'll drop $100 billion on compute, but when it comes to devs who make their products, all of a sudden they must desperately cut costs and hire as little as possible
to me it makes no sense from a business perspective. Same with Google, e.g. YouTube is utterly broken, slow and laggy, but I guess because you're forced to use it, it doesn't matter. But still, if you have these huge money stockpiles, why not deploy it to improve things? It wouldn't matter anyways, it's only upside
Perhaps I approach this from a different perspective than you do, so I’m interested to understand other viewpoints.
I review everything that my models produce the same way I review work from my coworkers: Trust but verify.
Your compiler doesn't do that. Your keyboard doesn't do that. The randomness is inside the tool itself, not around it. That's a fundamental reliability problem for any professional context where you need to know that input X produces output X, every time.
Not to mention that of course everyone A/B tests their output the whole time. You've never seen (or implemented) an A/B test where the test was whether to improve the way e.g. the invoicing software generates PDFs?
jfc. I don't have anything to say to this other than that it deserves calling out.
> You've never seen (or implemented) an A/B test where the test was whether to improve the way e.g. the invoicing software generates PDFs?
I have never in my life seen or implemented an a/b test on a tool used by professionals. I see consumer-facing tests on websites all the time, but nothing silently changing the software on your computer. I mean, there are mandatory updates, which I do already consider to be malware, but those are, at least, not silent.
Their outputs can vary in ways that superficially resemble human variability, but variability alone is a poor analogy for humanness. A more meaningful way to compare is to look at functional behaviors such as "pattern recognition", "contextual adaptation", "generalization to new prompts", and "multi-step reasoning". These behaviors resemble aspects of human capabilities. In particular, generalization allows LLMs to produce coherent outputs for tasks they were not explicitly trained on, rather than just repeating training data, making it a more meaningful measure than randomness alone.
That said, none of this means LLMs are conscious, intentional, or actually understanding anything. I am glad you brought up the seed and determinism point. People should know that you can make outputs fully predictable, so the "human-like" label mostly only shows up under stochastic sampling. It is far more informative to look at real functional capabilities instead of just variability, and I think more people should be aware of this.
"I had not realized ... exposures to a relatively simple computer program could induce powerful delusional thinking in quite normal people." -- Eliza's creator
If this is the case and the latest models can be explained through their weights and settings, please link it. I would like to see explainable ai up and coming.
What is your point? You get this from LLMs. It does not mean that it is not useful.
I want software that does a specific list of things, doesn’t change, and preferentially costs a known amount.
How often were features changed or deactivated by cloud services?
Plus things like not being able to control where the websearches go.
That said I have the luxury of being a hobbyist so I can accept 95% of cutting edge results for something more open. If it was my job I can see that going differently.
https://github.com/badlogic/pi-mono/tree/main/packages/codin...
But if you want to use it with Claude models you will have to pay per token (Claude subscriptions are only for use with Claude's own harnesses like claude code, the Claude desktop app, and the Claude Excel/Powerpoint extensions).
Whilst I broadly agree with their point, colour me unimpressed by this behaviour.
EDIT: God bless archive.org: https://web.archive.org/web/20260314105751/https://backnotpr.... This provides a lot more useful insight that, to me, significantly strengthens the point the article is making. Doesn’t mean I’m going to start picking apart binaries (though it wouldn’t be the first time), but how else are you supposed to really understand - and prove - what’s going on unless you do what the author did? Point is, it’s a much better, more useful, and more interesting article in its uncensored form.
EDIT 2: For me it’s not the fact that Anthropic are doing these tests that’s the problem: it’s that they’re not telling us, and they’re not giving us a way to select a different behaviour (which, if they did, would also give them useful insights into users needs).
Universities have IRBs for good reasons.
And unlike the university context, there’s a glut of data.
A basic technique: https://en.wikipedia.org/wiki/Inverse_probability_weighting
Or https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4384809
Which is still very cheap. There are other options, local Qwen 3.5 35b + claude code cli is, in my opinion, comparable in quality with Sonnet 4..4.5 - and without a/b tests!
And I won’t say how much my employer charges for me. But you can see how much the major consulting companies charge here
https://ceriusexecutives.com/management-consultants-whats-th...
The only metrics that matter is it done on time, on budget and meets requirements.
But if Claude Code is generating “useless code” for you, you’re doing it wrong
And I assure you that my implementations from six years of working with consulting departments/companies (including almost four as blue badge, RSU earning consultant at AWS ProServe) have never gone unused.
https://github.com/anthropics/claude-code/issues/21874#issue...
https://gist.github.com/gastonmorixe/9c596b6de1095b6bd3b746c...
Doing A/B tests on each part of the process to see where to draw the line (perhaps based on task and user) would seem a better way of doing it than arbitrarily choosing a limit.
Should people not complain about unannounced changes to the contents of their food or medicine because we don't understand everything about how the human body works?
I'm not sure I understand your last analogy. How would changes to the human body change the contents of the food that is eaten? It would be more analogous to compare it with unexpected changes to the body's output given the same inputs as previously, a phenomenon humans frequently experience.
There's some added flavor because the LLM is indeed non-deterministic, which could make it harder to realize that a change in behavior is caused by a change in the software, not randomness from the LLM. But there is also lots of software that deals with non-deterministic things that aren't LLMs, e.g. networks, physical sensors, scientific experiments, etc. Am I getting more timeouts because something is going on in my network or because some software I use is A/B testing some change?
https://web.archive.org/web/20260314105751/https://backnotpr...
Can’t believe HN has become so afraid of generic probably-unenforceable “plz don’t reverse engineer” EULAs. We deserve to know what these tools are doing.
I’ve seen poor results from plan mode recently too and this explains a lot.
It's very easy to just ban the user and if your whole workflow relies on the tool, you really don't want it.
Claude stated: in its system prompt, it had strict instructions to provide no context or details. Keep plans under forty lines of code. Be terse.
https://web.archive.org/web/20260314105751/https://backnotpr...
"Responsible" and "Ethic" are faaar gone.
Source? Every time I see claims on profitability it's always hand wavy justifications.
https://ezzekielnjuguna.medium.com/why-anthropic-is-practica...
>https://ezzekielnjuguna.medium.com/why-anthropic-is-practica...
You chose a bad one. It just asserts the 95% figure without evidence and then uses it as the premise for the rest of the article. That just confirms what I said earlier about how "Every time I see claims on profitability it's always hand wavy justifications.". Moreover the article reeks of LLM-isms.
b. Subscription content, features, and services. The content, features, and other services provided as part of your Subscription, and the duration of your Subscription, will be described in the order process. We may change or refresh the content, features, and other services from time to time, and we do not guarantee that any particular piece of content, feature, or other service will always be available through the Services.
It's also worth noting that section 3.3 explicitly disallows decompilation of the app.
To decompile, reverse engineer, disassemble, or otherwise reduce our Services to human-readable form, except when these restrictions are prohibited by applicable law.
Always read the terms. :)
Luckily, it doesn't seem like any service was reverse-engineered or decompiled here, only a software that lived on the authors disk.
Don't assume things about legal docs. You will often be wrong. Get a lawyer if it's something important.
> along with any associated apps, software, and websites (together, our “Services”)
As far as I understand, these terms actually hold up in court, too. Which is complete fucking nonsense that, I think, could only be the result of a technologically illiterate class making the decisions. Being penalised for trying to understand what software is doing on your machine is so wholly unreasonable that it should not be a valid contractual term.
Perhaps their TOS involves additional evils they are performing in the world, and it would be good to know about that.
Perhaps their TOS is restricting the US military from misusing the product and create unmonitored killbots.
Perhaps the person (as I do) does not feel that "laundering people's work at a massive scale" is unethical, any more than using human knowledge is unethical when those humans were allowed to spend decades reading copyrighted material in and out of school and most of what the human knows is derived from those materials and other conversations with people who didn't sign release forms before conversing.
Just because you think one thing is bad about someone doesn't mean no one should ever discuss any other topic about them.