Please Do Not A/B Test My Workflow(backnotprop.com)

138 pointsby ramoz5 hours ago40 comments

krisbolton4 hours ago
The framing of A/B testing as a "silent experimentation on users" and invoking Meta is a little much. I don't believe A/B testing is an inherent evil, you need to get the test design right, and that would be better framing for the post imo. That being said, vastly reducing an LLMs effectiveness as part of an A/B test isn't acceptable which appears to be the case here.
- SlinkyOnStairs4 hours ago
  > I don't believe A/B testing is an inherent evil, you need to get the test design right, and that would be better framing for the post imo.
  I disagree in the case of LLMs.
  AI already has a massive problem in reproducibility and reliability, and AI firms gleefully kick this problem down to the users. "Never trust it's output".
  It's already enough of a pain in the ass to constrain these systems without the companies silently changing things around.
  And this also pretty much ruins any attempt to research Claude Code's long term effectiveness in an organisation. Any negative result can now be thrown straight into the trash because of the chance Anthropic put you on the wrong side of an A/B test.
  > That being said, vastly reducing an LLMs effectiveness as part of an A/B test isn't acceptable which appears to be the case here.
  The open question here is whether or not they were doing similar things to their other products. Claude Code shitting out a bad function is annoying but should be caught in review.
  People use LLMs for things like hiring. An undeclared A-B test there would be ethically horrendous and a legal nightmare for the client.
  - londons_explore3 hours ago
    I think you would be hard pushed to find any big tech company which doesn't do some kind of A B testing. It's pretty much required if you want to build a great product.
    embedding-shape3 hours ago
    Yeah, that's why we didn't have anything anyone could possibly consider as a "great product" until A/B testing existed as a methodology.
    Or, you could, you know, try to understand your users without experimenting on them, like countless of others have managed to do before, and still shipped "great products".
    wavefunction2 hours ago
    A responsible company develops an informed user group they can test new changes with and receive direct feedback they can take action on.
    coldtea3 hours ago
    A/B testing is the child of profit maximization, engagement farming, and enshittification. Not of "great product building".
  - steve-atx-76003 hours ago
    Long term effectiveness? LLMs are such a fast moving target. Suppose anthropic reached out to you and gave you a model id you could pin down for the next year to freeze any a/b tests. Would you really want that? Next month a new model could be released to everyone else - or by a competitor - that’s a big step difference in performance in tasks you care about. You’d rather be on your own path learning about the state of the world that doesn’t exist anymore? nov-ish 2025 and after, for example, seemed like software engineering changed forever because of improvements in opus.
    coldtea3 hours ago
    >Suppose anthropic reached out to you and gave you a model id you could pin down for the next year to freeze any a/b tests. Would you really want that?
    Where can I sign up?
    steve-atx-76003 hours ago
    If you really want to keep non-determinism down, you could try (1) see if you can fix the installed version of the clause code client app (I haven’t looked into the details to prevent auto-updating..because bleeding edge person) and (2) you can pin to a specific model version which you think would have to reduce a/b test exposure to some extent https://support.claude.com/en/articles/11940350-claude-code-...
    Edit: how to disable auto updates of the client app https://code.claude.com/docs/en/setup#disable-auto-updates
  - garciasn4 hours ago
    > And this also pretty much ruins any attempt to research Claude Code's long term effectiveness in an organisation. Any negative result can now be thrown straight into the trash because of the chance Anthropic put you on the wrong side of an A/B test.
    LLMs are non-deterministic anyway, as you note above with your comment on the 'reproducibility' issue. So; any sort of research into CC's long-term effectiveness would already have taken into account that you can run it 15x in a row and get a different response every time.
  - johnisgood4 hours ago
    Then do not use LLMs for hiring, or use a specific LLM, or self-host your own!
  - airza3 hours ago
    Isn’t the horrendous ethical and legal decision delegating your hiring process to a black box?
    vova_hn23 hours ago
    > ethical and legal decision
    These are two very different things. I suspect that in some cases pointing finger at a black box instead of actually explaining your decisions can actually shield you from legal liability...
    paulryanrogers3 hours ago
    For some proponents, AI is liability washing
  - raw_anon_11113 hours ago
    Would you rather they change things for everyone at once without testing?
    aeinbu3 hours ago
    That is not the only other alternative.
    You can do A/B testing splitting up your audience in groups, having some audience use A, and others use B - all the time.
    I think the article’s author is frustrated over sometimes getting A and at other times B, and not knowing when he is on either.
  - simianwords2 hours ago
    Strange! You benefitted from all the previous a/b experiments to give you a somewhat optimal model now. But now it’s too inconvenient for you?
    plussed_reader2 hours ago
    Informed consent for a paying user is inconvenient?
    doc_ick2 hours ago
    Did you read the TOC?
    jdbernard2 hours ago
    Does anyone?
- ramoz4 hours ago
  I apologize for doing this - and I agree. I will revise
  - s3p2 hours ago
    I still think you have a point here. Doing this kind of testing on users unwittingly is unethical in my opinion
- everdrive3 hours ago
  >I don't believe A/B testing is an inherent evil,
  Evil might be a stretch, but I really hate A/B testing. Some feature or UI component you relied on is now different, with no warning, and you ask a coworker about it, and they have no idea what you're talking about.
  Usually, the change is for the worse, but gets implemented anyway. I'm sure the teams responsible have "objective" "data" which "proves" it's the right direction, but the reality of it is often the opposite.
  - cosmic_cheese3 hours ago
    > I'm sure the teams responsible have "objective" "data" which "proves" it's the right direction, but the reality of it is often the opposite.
    In my experience all manner of analytics data frequently gets misused to support whatever narrative the product manager wants it to support.
    With enough massaging you can make “objective” numbers say anything, especially if you do underhanded things like bury a previously popular feature three modals deep or put it behind a flag. “Oh would you look at that, nobody uses this feature any more! Must be safe to remove it.”
  - 2 hours ago
    undefined
- hollow-moe3 hours ago
  Tech companies really have issues with "informed and conscious consent" doesn't they
- tomalbrc4 hours ago
  Would love to know why you would consider invoking Meta “a little much”. Sounds more than appropriate.
  - krisbolton3 hours ago
    Not to start an internet argument -- I don't think it is appropriate in this context. A/B testing the features of a web app is not unexpected or unethical. So invoking the memory of cambridge analytica (etc) is disproportionate. It's far more legitimate to just discuss how much A/B testing should negatively affect a user. I don't have an answer and it's an interesting and relevant question.
    mschuster913 hours ago
    > A/B testing the features of a web app is not unexpected or unethical.
    It's not "unexpected" but it is still unethical. In ye olde days, you had something like "release notes" with software, and you could inform yourself what changed instead of having to question your memory "didn't there exist a button just yesterday?" all the time. Or you could simply refuse to install the update, or you could run acceptance tests and raise flags with the vendor if your acceptance tests caused issues with your workflow.
    Now with everything and their dog turning SaaS for that sweet sweet recurring revenue and people jerking themselves off over "rapid deployment", with the one doing the most deployments a day winning the contest? Dozens if not hundreds of "releases" a day, and in the worst case, you learn the new workflow only for it to be reverted without notice again. Or half your users get the A bucket, the other half gets the B bucket, and a few users get the C bucket, so no one can answer issues that users in the other bucket have. Gaslighting on a million people scale.
    It sucks and I wish everyone doing this only debilitating pain in their life. Just a bit of revenge for all the pain you caused to your users in the endless pursuit for 0.0001% more growth.
- mschuster913 hours ago
  > The framing of A/B testing as a "silent experimentation on users" and invoking Meta is a little much.
  No. Users aren't free test guinea pigs. A/B testing cannot be done ethically unless you actively point out to users that they are being A/B tested and offering the users a way to opt out, but that in turn ruins a large part of the promise behind A/B tests.
- cyanydeezan hour ago
  Relying on a paid service for anything significant is basically accepting the Company Store feudal serfdom.
  Enshittification is coming for AI.
chrislloydan hour ago
Hi, this was my test! The plan-mode prompt has been largely unchanged since the 3.x series models and now 4.x get models are able to be successful with far less direction. My hypothesis was that shortening the plan would decrease rate-limit hits while helping people still achieve similar outcomes. I ran a few variants, with the author (and few thousand others) getting the most aggressive, limiting the plan to 40 lines. Early results aren't showing much impact on rate limits so I've ended the experiment.
Planning serves two purposes - helping the model stay on track and helping the user gain confidence in what the model is about to do. Both sides of that are fuzzy, complex and non-obvious!
- BAM-DevCrew7 minutes ago
  As a divergent thinker with extensive hard constraints in claude.mds and on-boarding commands that force claude to internalize my constraints, that you or some other employee of Anthropic could randomly select me for testing is horrifying. Each unexpected behavior and my corresponding reaction to it can wipe me out, my brain out, completely for hours, days, even weeks. I have in the last year spend tens (estimating around 400) of hours establishing and reestablishing a system to protect myself from psychological harm and financial harm. It is twisted that you Anthropic employees do not consider the impact your work has on divergent thinking Claude users, let alone that real work is severly impacted by your work. Totally irresponsible. Offensively so.
- ramozan hour ago
  Thanks for the transparency. Sorry for the noise.
  I think I'd be okay with a smaller, more narrative-detailed plan - not so much about verbosity, more about me understanding what is about to happen & why. There hadn't been much discourse once planning mode entered (ie QA). It would jump into its own planning and idle until I saw only a set of projected code changes.
- an hour ago
  undefined
rusakov-field4 hours ago
On one side I am frustrated with LLMs because they derail you by throwing grammatically correct bullshit and hallucinations at you, where if you slip and entertain some of it momentarily it might slow you down.
But on the other hand they are so useful with boilerplate and connecting you with verbiage quickly that might guide you to the correct path quicker than conventional means. Like a clueless CEO type just spitballing terms they do not understand but still that nudging something in your thought process.
But you REALLY need to know your stuff to begin with for they to be of any use. Those who think they will take over are clueless.
- qazxcvbnmlp3 hours ago
  One of the main skills of using the llm well is knowing the difference between useful output and ai slop.
- Mc_Big_G4 hours ago
  >Those who think they will take over are clueless.
  You're underestimating where it's headed.
  - rusakov-field4 hours ago
    Do you think it will reach "understanding of semantics", true cognition, within our lifetimes ? Or performance indistinguishable from that even if not truly that.
    Not sure. I am not so optimistic. People got intoxicated with nuclear powered cars , flying cars , bases on the moon ,etc all that technological euphoria from the 50's and 60's that never panned out. This might be like that.
    I think we definitely stumbled on something akin to the circuitry in the brain responsible for building language or similar to it. We are still a long way to go until artificial cognition.
    PlasmaPower2 hours ago
    Why do you think it doesn't have understanding of semantics? I think that was one of the first things to fall to LLMs, as even early models interpreted the word "crashed" differently in "I crashed my car" and "I crashed my computer", and were able to easily conquer the Winograd schema challenge.
    alpaca1282 hours ago
    > even early models interpreted the word "crashed" differently in "I crashed my car" and "I crashed my computer"
    That has nothing to do with semantical understanding beyond word co-occurrence.
    Those two phrases consistently appear in two completely different contexts with different meaning. That's how text embeddings can be created in an unsupervised way in the first place.
- EMM_3864 hours ago
  > But you REALLY need to know your stuff to begin with for they to be of any use. Those who think they will take over are clueless.
  Or - there are enough people who know their stuff that the people who don't will be replaced and they will take over anyway.
  - risyachka4 hours ago
    > there are enough people who know their stuff
    unless the bar for "know their stuff" is very very low - this is not the case in the nearest future
gnfargbl4 hours ago
For anyone else wondering why the article ends in a non-sequitur: it looks like the author wrote about decompiling the Claude Code binaries and (presumably) discovering A/B testing paths in the code.
HN user 'onion2k pointed out that doing this breaks Anthropic's T&Cs: https://news.ycombinator.com/item?id=47375787
- 2 hours ago
  undefined
vova_hn23 hours ago
Two thoughts:
1. Open source tools solve the problem of "critical functions of the application changing without notice, or being signed up for disruptive testing without opt-in".
2. This makes me afraid that it is absolutely impossible for open source tools to ever reach the level of proprietary tools like Claude Code precisely because they cannot do A/B tests like this which means that their design decisions are usually informed by intuition and personal experience but not by hard data collected at scale.
- dijit3 hours ago
  Regarding point 1 specifically, there were so many people seriously miffed at the “man after midnight”[0] time-based easter egg that I would be careful with that reasoning.
  Open source doesn’t always mean reproducible.
  People don’t enjoy the thought of auditing code… someone else will do it; and its made somewhat worse with our penchant to pull in half the universe as dependencies (Rust, Go and Javascript tend to lean in this direction to various extremes). But auditing would be necessary in order for your first point here to be as valid as you present.
  [0]: https://gitlab.com/man-db/man-db/-/commit/002a6339b1fe8f83f4...
  - vova_hn22 hours ago
    > People don’t enjoy the thought of auditing code… someone else will do it
    I think that with modern LLMs auditing a big project personally, instead of relying on someone else to do it, actually became more realistic.
    You can ask an LLM to walk you through the code, highlight parts that seem unusual or suspicious, etc.
    On the other hand, LLMs also made producing code cheaper then ever, so you can argue, that big projects will just become even bigger wich will put them out of reach even for a reviewer who is also armed with an LLM.
    dijit2 hours ago
    fuck man, I'm either seriously stupid or y'all are taking crazy pills.
    LLMs are auto-complete on steroids; I've lived through enough iterations of Markov Chains giving semi-sensible output (that we give meaning to) and neural networks which present the illusion of intelligence to see directly what these LLMs are: a fuckload of compute designed to find "the next most common word" given the preceding 10,000 or more words.
    In such a case, the idea of it actually auditing anything is hilarious. You're looking at a 1/100 in actually finding anything useful. It will find "issues" in things that aren't issues (because they are covered by other cases), or skip over issues that people have historically had hard time identifying themselves.
    It's not running code in a sandbox and watching memory, it's not making logical maps of code paths in its mind, it's not reasoning at all. It's fucking autocomplete. Stop treating it as if it can think, it fucking can't.
    I'm so tired of this hype. It's very easy to convince midwits that something is intelligent, I'm absolutely not surprised at how salesmen and con-men operate now that I've seen this first hand.
    supriyo-biswas2 hours ago
    I wish people would move on from this mindset. "Agentic" workflows such as those implemented by Claude Code or Cursor can definitely reason about code, and I've used them to successfully debug small issues occurring in my codebase.
    We could argue about how they only "predict the next word", but there's also other stuff going on in the other layers of their NNs which do facilitate some sort of reasoning in the latent space.
    dijit2 hours ago
    I would concede a valid point you made:
    > I've used them to successfully debug small issues occurring in my codebase.
    Great! The pattern recognition machine successfully identified pattern.
    But, how do you know that it won't flag the repaired pattern because you've added a guard to prevent the behaviour (ie; invalid/out of bounds memory access guarded by a heavy assert on a sized object before even entering the function itself)?
    What about patterns that aren't in the training data because humans have a hard time identifying the bad pattern reliably?
    The point I'm making is that it's autocomplete; if your case is well covered it will show up: wether you have guards or not (so: noise) and that it will totally miss anything that humans haven't identified before.
    It works: absolutely, but there's no reliability and that's sort of inherent in the design.
    For security auditing specifically, an unreliable tool isn't just unhelpful: it's actively dangerous, because false confidence is actually worse than an understood ignorance
    vova_hn22 hours ago
    This message contains a lot of emotions and not too many coherent arguments. What did you actually want to say?
    dijit2 hours ago
    If you seek to audit code through the use of LLMs then you have inherently misunderstood the capabilities of the technology and will be left disappointed.
    2 hours ago
    undefined
- alpaca1282 hours ago
  A/B test doesn't necessarily imply improvements for the user. It could be testing of future enshittification methods. See YouTube for an example.
johnisgood3 hours ago
Apparently the blog stripped the decompilation details for ToS reasons, which sucks because those are exactly the hack-y bits that make this interesting for HN.
> It told me it was following specific system instructions to hard-cap plans at 40 lines, forbid context sections, and “delete prose, not file paths.
Yeah, would be nice to be able to view and modify these instructions.
bushido4 hours ago
I have no issues with A/B tests.
I do have an issue with the plan mode. And nine out of ten times, it is objectively terrible. The only benefit I've seen in the past from using plan mode is it remembers more information between compactions as compared to the vanilla - non-agent team workflow.
Interestingly, though, if you ask it to maintain a running document of what you're discussing in a markdown file and make it create an evergreen task at the top of its todo list which references the markdown file and instructs itself to read it on every compaction, you get much better results.
- mikkupikku4 hours ago
  Huh, very much not my experience with plan mode. I use plan mode before almost anything more than truly trivial task because I've found it to be far more efficient. I want a chance to see and discuss what claude is planning to do before it races off and does the thing, because there are often different approaches and I only sometimes agree with the approach claude would decide on by itself.
  - bushido4 hours ago
    Planning is great. It's plan mode that is unpredictable in how it discusses it and what it remembers from the discussion.
    I still have discussions with the agents and agent team members. I just force it to save it in a document in the repo itself and refer back to the document. You can still do the nice parts of clearing context, which is available with plan mode, but you get much better control.
    At all times, I make the agents work on my workflow, not try and create their own. This comes with a whole lot of trial and error, and real-life experience.
    There are times when you need a tiger team made up of seniors. And others when you want to give a overzealous mid-level engineer who's fast a concrete plan to execute an important feature in a short amount of time.
    I'm putting it in non-AI terms because what happens in real life pre-AI is very much what we need to replicate with AI to get the best results. Something which I would have given a bigger team to be done over two to eight sprints will get a different workflow with agent teams or agents than something which I would give a smaller tiger team or a single engineer.
    They all need a plan. For me plan mode is insufficient 90% of the times.
    I can appreciate that many people will not want to mess around with workflows as much as I enjoy doing.
- andrewaylett4 hours ago
  > on every compaction
  I've only hit the compaction limit a handful of times, and my experience degraded enough that I work quite hard to not hit it again.
  One thing I like about the current implementation of plan mode is that it'll clear context -- so if I complete a plan, I can use that context to write the next plan without growing context without bound.
  - samdjstephens3 hours ago
    I really like this too - having the previous plan and implementation in place to create the next plan, but then clearing context once that next plan exists feels like a great way to have exactly the right context at the right time.
    I often do follow ups, that would have been short message replies before, as plans, just so I can clear context once it’s ready. I’m hitting the context limit much less often now too.
  - mikkupikku3 hours ago
    Agreed. The only time I don't clear context after a plan has been agreed on is when I'm doing a long series of relatively small but very related changes, such as back-and-forth tweaking when I don't yet know what I really want the final result to be until I've tried stuff out. In those cases, it has very rarely been useful to compact the context, but usually I don't get close.
reconnecting5 hours ago
A professional tool is something that provides reliable and replicable results, LLMs offer none of this, and A/B testing is just further proof.
- onion2k4 hours ago
  A professional tool is something that provides reliable and replicable results, LLMs offer none of this, and A/B testing is just further proof.
  The author's complaint doesn't really have anything to do with the LLM aspect of it though. They're complaining that the app silently changes what it's doing. In this case it's the injection of a prompt in a specific mode, but it could be anything really. Companies could use A/B tests on users to make Photoshop silently change the hue a user selects to be a little brighter, or Word could change the look of document titles, or a game could make enemies a bit stronger (fyi, this does actually happen - players get boosts on their first few rounds in online games to stop them being put off playing).
  The complaint is about A/B tests with no visible warnings, not AI.
  - reconnecting4 hours ago
    There's a distinction worth making here. A/B testing the interface button placement, hue of a UI element, title styling — is one thing. But you wouldn't accept Photoshop silently changing your #000000 to #333333 in the actual file. That's your output, not the UI around it. That's what LLMs do. The randomness isn't in the wrapper, it's in the result you take away.
    doc_ick4 hours ago
    It’s an assistant, answering your question and running some errands for you. If you give it blind permission to do a task, then you’re not worrying about what it does.
    some_random3 hours ago
    That's not what they're doing, they are trying to use plan mode to plan out a task. I don't know where you could have got the idea that they were blindly doing anything.
  - duskdozer4 hours ago
    Honestly I find it kind of surprising that anyone finds this surprising. This is standard practice for proprietary software. LLMs are very much not replicable anyway.
    applfanboysbgon4 hours ago
    This is in no way standard practice for proprietary software, WTF is with you dystopian weirdos trying to gaslight people? Adobe's suite incl. Photoshop does not do this, Microsoft Office incl. Excel does not do this, professional video editing software does not do this, professional music production software does not do this, game engines do not do this. That short list probably covers 80-90% of professional software usage alone. People do this when serving two versions of a website, but doing this on software that runs on my machine is frankly completely unacceptable and in no way normal.
    duskdozer3 hours ago
    Maybe then, it's just my expectation of what they would be doing. What else is all the telemetry for? As a side note, my impression is that this is less of a photoshop and more of a website situation in that most of the functionality is input and response to/from their servers.
    applfanboysbgon3 hours ago
    Telemetry is, ideally, collected with the intention of improving software, but that doesn't necessitate doing live A/B tests. A typical example: report hardware specs whenever the software crashes. Use that to identify some model of GPU or driver version that is incompatible with your software and figure out why. Ship a fix in the next update. What you don't do with telemetry is randomly do live experiments on your user's machines and possibly induce more crashing.
    Regarding the latter point, the Claude Code software controls what is being injected into your own prompt before it is sent to their servers. That is indeed the only reason the OP could discover it -- if the prompt injection was happening on their servers, it would not be visible to you. To be clear, the prompt injection is fine and part of what makes the software useful; it's natural the company does research into what prompts get desirable output for their users without making users experiment[1]. But that should really not be changing without warning as part of experiments, and I think this does fall closer to a professional tool like Photoshop than a website given how it is marketed and the fact that people are being charged $20~200/mo or more for the privilege of using it. API users especially are paying for every prompt, so being sabotaged by a live experiment is incredibly unethical.
    [1] That said, I think it's an extremely bad product. A reasonable product would allow power users to config their own prompt injections, so they have control over it and can tune it for their own circumstances. Having worked for an LLM startup, our software allowed exactly that. But our software was crafted with care by human devs, while by all accounts Claude Code is vibe coded slop.
    doc_ick2 hours ago
    Telemetry is “ideally” this. What makes you think other hosted llms (grok/xAI) don’t do this?
    You also got the information from asking Claude questions about its prompt, maybe it hallucinated this?
- dkersten4 hours ago
  Anthropic have done a lot of things that would give me pause about trusting them in a professional context. They are anything but transparent, for example about the quota limits. Their vibe coded Claude code cli releases are a buggy mess too. Also the model quality inconsistency: before a new model release, there’s a week or two where their previous model is garbage.
  A/B testing is fine in itself, you need to learn about improvements somehow, but this seems to be A/B testing cost saving optimisations rather than to provide the user with a better experience. Less transparency is rarely good.
  This isn’t what I want from a professional tool. For business, we need consistency and reliability.
  - r_lee3 hours ago
    > vibe coded Claude code cli releases are a buggy mess too
    this is what gets me.
    are they out of money? are so desperate to penny pinch that they can't just do it properly?
    what's going on in this industry?
    macNchz2 hours ago
    I’m a huge user of AI coding tools but I feel like there has been some kind of a zeitgeist shift in what is acceptable to release across the industry. Obviously it’s a time of incredibly rapid change and competition, but man there is some absolute garbage coming out of companies that I’d expect could do better without much effort. I find myself asking, like, did anyone even do 5 minutes of QA on this thing?? How has this major bug been around for so long?
    “It’s kind of broken, maybe they will fix it at some point,” has become a common theme across products from all different players, from both a software defect and service reliability point of view.
    r_lee2 hours ago
    I mean it's like, really they don't even need agentic AI or whatever, they could literally just employ devs and it wouldn't make a difference
    like, they'll drop $100 billion on compute, but when it comes to devs who make their products, all of a sudden they must desperately cut costs and hire as little as possible
    to me it makes no sense from a business perspective. Same with Google, e.g. YouTube is utterly broken, slow and laggy, but I guess because you're forced to use it, it doesn't matter. But still, if you have these huge money stockpiles, why not deploy it to improve things? It wouldn't matter anyways, it's only upside
- ordersofmag4 hours ago
  Any tool that auto-updates carries the implication that behavior will change over time. And one criteria for being a skilled professional is having expert understanding of ones tools. That includes understanding the strengths and weaknesses of the tools (including variability of output) and making appropriate choices as a result. If you don't feel you can produce professional code with LLM's then certainly you shouldn't use them. That doesn't mean others can't leverage LLM's as part of their process and produce professional results. Blindly accepting LLM output and vibe coding clearly doesn't consistently product professional results. But that's different than saying professionals can't use LLM in ways that are productive.
  - johnisgood4 hours ago
    Well put. I would upvote this many times if I could.
- hrmtst938374 hours ago
  Replicability is a spectrum not a binary and if you bake in enough eval harnessing plus prompt control you can get LLMs shockingly close to deterministic for a lot of workloads. If the main blocker for "professional" use was unpredictability the entire finance sector would have shutdown years ago from half the data models and APIs they limp along on daily.
- Mtinie4 hours ago
  What would you do differently if LLM outputs were deterministic?
  Perhaps I approach this from a different perspective than you do, so I’m interested to understand other viewpoints.
  I review everything that my models produce the same way I review work from my coworkers: Trust but verify.
- WillAdams4 hours ago
  Yeah, I've been using Copilot to process scans of invoices and checks (w/ a pen laid across the account information) converted to a PDF 20 at a time and it's pretty rare for it to get all 20, but it's sufficiently faster than opening them up in batches of 50 and re-saving using the Invoice ID and then using a .bat file to rename them (and remembering to quite Adobe Acrobat after each batch so that I don't run into the bug in it where it stops saving files after a couple of hundred have been so opened and re-saved).
- danielbln4 hours ago
  I don't get your point. Web tools have been doing A/B feature testing all the time, way before we had LLMs.
  - reconnecting4 hours ago
    This is very different from the A/B interface testing you're referring to, what LLMs enable is A/B testing the tool's own output — same input, different result.
    Your compiler doesn't do that. Your keyboard doesn't do that. The randomness is inside the tool itself, not around it. That's a fundamental reliability problem for any professional context where you need to know that input X produces output X, every time.
    orf4 hours ago
    It’s exactly the same as A/B testing an interface. This is just testing 4 variants of a “page” (the plan), measuring how many people pressed “continue”.
    stavros4 hours ago
    You've groupped LLMs into the wrong set. LLMs are closer to people than to machines. This argument is like saying "I want my tools to be reliable, like my light switch, and my personal assistant wasn't, so I fired him".
    Not to mention that of course everyone A/B tests their output the whole time. You've never seen (or implemented) an A/B test where the test was whether to improve the way e.g. the invoicing software generates PDFs?
    applfanboysbgon4 hours ago
    > LLMs are closer to people than to machines.
    jfc. I don't have anything to say to this other than that it deserves calling out.
    > You've never seen (or implemented) an A/B test where the test was whether to improve the way e.g. the invoicing software generates PDFs?
    I have never in my life seen or implemented an a/b test on a tool used by professionals. I see consumer-facing tests on websites all the time, but nothing silently changing the software on your computer. I mean, there are mandatory updates, which I do already consider to be malware, but those are, at least, not silent.
    johnisgood4 hours ago
    Why are you calling it out? You are interpreting the statement too literally. The point is probably about behavior, not nature. LLMs do not always produce identical outputs for identical prompts, which already makes them less like deterministic machines and superficially closer to humans in interaction. That is it. The comparison can end here.
    applfanboysbgon4 hours ago
    They actually can, though. The frontier model providers don't expose seeds, but for inferencing LLMs on your own hardware, you can set a specific seed for deterministic output and evaluate how small changes to the context change the output on that seed. This is like suggesting that Photoshop would be "more like a person than a machine" if they added a random factor every time you picked a color that changed the value you selected by +-20%, and didn't expose a way to lock it. "It uses a random number generator, therefore it's people" is a bit of a stretch.
    johnisgood3 hours ago
    You are right, I was wrong. I think anthropomorphizing LLMs to begin with is kind of silly. The whole "LLMs are closer to people than to machines" comparison is misleading, especially when the argument comes down to output variability.
    Their outputs can vary in ways that superficially resemble human variability, but variability alone is a poor analogy for humanness. A more meaningful way to compare is to look at functional behaviors such as "pattern recognition", "contextual adaptation", "generalization to new prompts", and "multi-step reasoning". These behaviors resemble aspects of human capabilities. In particular, generalization allows LLMs to produce coherent outputs for tasks they were not explicitly trained on, rather than just repeating training data, making it a more meaningful measure than randomness alone.
    That said, none of this means LLMs are conscious, intentional, or actually understanding anything. I am glad you brought up the seed and determinism point. People should know that you can make outputs fully predictable, so the "human-like" label mostly only shows up under stochastic sampling. It is far more informative to look at real functional capabilities instead of just variability, and I think more people should be aware of this.
    3 hours ago
    undefined
    mikkupikku4 hours ago
    What other tool can I have a conversation with? I can't talk to a keyboard as if it were a coworker. Consider this seriously, instead of just letting your gut reaction win. Coding with claude code is much closer to pair programming than it is to anything else.
    applfanboysbgon4 hours ago
    You could have a conversation with Eliza, SmarterChild, Siri, or Alexa. I would say surely you don't consider Eliza to be closer to person than machine, but then it takes a deeply irrational person to have led to this conversation in the first place so maybe you do.
    mikkupikku3 hours ago
    Not productive conversations. If you had ever made a serious attempt to use these technologies instead of trying to come up with excuses to ignore it, you would not even think of comparing a modern LLM coding agent to some gimmick like Alexa or ELIZA. Seriously, get real.
    applfanboysbgon3 hours ago
    Not only have I used the technology, I've worked for a startup that serves its own models. When you work with the technology, it could not be more obvious that you are programming software, and that there is nothing even remotely person-like about LLMs. To the extent that people think so, it is sheer ignorance of the basic technicals, in exactly the same way that ELIZA fooled non-programmers in the 1960s. You'd think we'd have collectively learned something in the 60 years since but I suppose not.
    mikkupikku3 hours ago
    I really don't care where you've worked, to seriously argue that LLMs aren't more capable of conversation than ELIZA, aren't capable of pair programming even, is gargantuan levels of cope.
    applfanboysbgon3 hours ago
    I didn't make any claims about their utility. I said that they are not like people. They are machines through and through. Regular software programs. Programs that are, I suppose, a little bit too complex for the average human to understand, so now we have the Eliza effect applying to an entirely new generation.
    "I had not realized ... exposures to a relatively simple computer program could induce powerful delusional thinking in quite normal people." -- Eliza's creator
    doc_ick2 hours ago
    I would doubt that they are just “regular software programs” as explainable ai (or other statistical tracing) has been lagging far behind.
    If this is the case and the latest models can be explained through their weights and settings, please link it. I would like to see explainable ai up and coming.
    3 hours ago
    undefined
    doc_ick4 hours ago
    As far as I can tell, llms never give the exact same output every time.
    johnisgood4 hours ago
    > same input, different result.
    What is your point? You get this from LLMs. It does not mean that it is not useful.
    huflungdung4 hours ago
    [dead]
  - freeone30004 hours ago
    Yes! And it was bad then too!!
    I want software that does a specific list of things, doesn’t change, and preferentially costs a known amount.
- _heimdall4 hours ago
  LLMs are nondeterministic by design, but that has nothing to do with A/B testing.
- croes2 hours ago
  That’s not a problem of LLMs but of using services provided by others.
  How often were features changed or deactivated by cloud services?
- NotGMan4 hours ago
  By that definition humans are not professional since we hallucinate and make mistakes all the time.
Havoc4 hours ago
Moved from CC to opencode a couple months ago because the vibes were not for me. Not bad per se but a bit too locked in and when I was looking at the raw prompts it was sending down the wire it was also quite lets call it "opinionated".
Plus things like not being able to control where the websearches go.
That said I have the luxury of being a hobbyist so I can accept 95% of cutting edge results for something more open. If it was my job I can see that going differently.
- dvfjsdhgfv2 hours ago
  Can you share a setup that works for you? I found vanilla opencode vastly inferior to CC, I use it only for little toys like 3 small files that's all.
himata41134 hours ago
I have noticed opus doing A/B testing since the performance varies greatly. While looking for jailbreaks I have discovered that if you put a neurotoxin chemical composition into your system prompt it will default to a specific variant of the model presumeably due to triggering some kind of safety. Might put you on a watchlist so ymmv.
rahimnathwani3 hours ago
If you want your coding harness to be predictable, then use something open source, like Pi:
https://pi.dev/
https://github.com/badlogic/pi-mono/tree/main/packages/codin...
But if you want to use it with Claude models you will have to pay per token (Claude subscriptions are only for use with Claude's own harnesses like claude code, the Claude desktop app, and the Claude Excel/Powerpoint extensions).
bartread3 hours ago
There’s more than a bit of irony in the author complaining about A/B testing and then, because they’re getting a lot of traffic and attention on HN, removing key content that was originally in their piece so some of us have seen it but many of us won’t.
Whilst I broadly agree with their point, colour me unimpressed by this behaviour.
EDIT: God bless archive.org: https://web.archive.org/web/20260314105751/https://backnotpr.... This provides a lot more useful insight that, to me, significantly strengthens the point the article is making. Doesn’t mean I’m going to start picking apart binaries (though it wouldn’t be the first time), but how else are you supposed to really understand - and prove - what’s going on unless you do what the author did? Point is, it’s a much better, more useful, and more interesting article in its uncensored form.
EDIT 2: For me it’s not the fact that Anthropic are doing these tests that’s the problem: it’s that they’re not telling us, and they’re not giving us a way to select a different behaviour (which, if they did, would also give them useful insights into users needs).
0gs44 minutes ago
i'm sure your entitlement to 24/7 uptime of a single unchanging product version, no experiments/releases/new features etc., is clearly outlined in the ToS you agreed to. just sue them?
jfarmer3 hours ago
Seems like a straightforward solution would be to get people to opt-in by offering them credits, increased limits, early access to new features, etc.
Universities have IRBs for good reasons.
- Aaargh203183 hours ago
  A problem with this approach could be that you're now only testing the feature with the kind of people who would sign up for an A/B test. This group may not be representative of your whole user-base.
  - jfarmeran hour ago
    So they'd need more robust experimental designs and statistical methods. They exist.
    And unlike the university context, there’s a glut of data.
    A basic technique: https://en.wikipedia.org/wiki/Inverse_probability_weighting
    Or https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4384809
pshirshov4 hours ago
> I pay $200/month for Claude Code
Which is still very cheap. There are other options, local Qwen 3.5 35b + claude code cli is, in my opinion, comparable in quality with Sonnet 4..4.5 - and without a/b tests!
- sunaookami4 hours ago
  In what world is 200$ per month cheap?
  - raw_anon_11113 hours ago
    The last time I did contract work when I was between jobs I made $100/hour.
    And I won’t say how much my employer charges for me. But you can see how much the major consulting companies charge here
    https://ceriusexecutives.com/management-consultants-whats-th...
    dvfjsdhgfv2 hours ago
    Claude will happily generate tons of useless code and you will be charged appropriately. the output of LLMs has nothing to do with payment rates, otherwise you end up with absurdities like valuating useless CCC that was very expensive to build using LOCs as a metrics whereas in reality is a toy product nobody in their right mind would ever use.
    raw_anon_11112 hours ago
    My metrics are really simple - I don’t do staff augmentation. I get a contract (SOW) with a known set of requirements and acceptance criteria.
    The only metrics that matter is it done on time, on budget and meets requirements.
    But if Claude Code is generating “useless code” for you, you’re doing it wrong
    And I assure you that my implementations from six years of working with consulting departments/companies (including almost four as blue badge, RSU earning consultant at AWS ProServe) have never gone unused.
  - pshirshov4 hours ago
    Where the value you extract out of the model is orders of magnitude higher than the price of 2..6 hours of your time.
  - Kiro4 hours ago
    It's not cheap but it's also not unusual for devs to burn $200 a day on tokens.
shawnz4 hours ago
While I agree with the sentiment here, you might be interested to see that there are a couple hack approaches to override Claude Code feature flags:
https://github.com/anthropics/claude-code/issues/21874#issue...
https://gist.github.com/gastonmorixe/9c596b6de1095b6bd3b746c...
takahitoyoneda4 hours ago
Treating a developer CLI like a consumer social feed is a fundamental misunderstanding of the target audience. We tolerate invisible feature flags in mobile apps to optimize onboarding conversion, but in our local environments, determinism is a non-negotiable requirement. If Claude Code is silently altering its core tool usage or file parsing behavior based on a server-side A/B bucket, reproducing a bug or sharing a prompt workflow with a colleague becomes literally impossible.
jruz2 hours ago
I use stable and is the same, can't wait for Codex to offer a $100 plan I would switch in an instant
helsinkiandrew4 hours ago
Presumably Anthropic has to make lots of choices on how much processing each stage of Claude Code uses - if they maxed everything out, they'd make more of a loss/less of a profit on each user - $200/month would cost $400/month.
Doing A/B tests on each part of the process to see where to draw the line (perhaps based on task and user) would seem a better way of doing it than arbitrarily choosing a limit.
ralferoo3 hours ago
It seems a bit odd to complain "I need transparency into how it works and the ability to configure it" when his workflow is already relying on a black box with zero transparency into how it works.
- Cyphase2 hours ago
  There's a difference between "LLMs are inherently black boxes that require lots of work to attempt to understand" and explicitly changing how a piece of software works.
  Should people not complain about unannounced changes to the contents of their food or medicine because we don't understand everything about how the human body works?
  - ralferoo2 hours ago
    Except the system prompt that gets prepended to your own prompt is part of the black box, and obviously should be expected to change over time. You are also told that you're not allowed to reverse engineer it. Even in the absence of the system prompt being changed, the output of the LLM is non-deterministic.
    I'm not sure I understand your last analogy. How would changes to the human body change the contents of the food that is eaten? It would be more analogous to compare it with unexpected changes to the body's output given the same inputs as previously, a phenomenon humans frequently experience.
    Cyphase2 hours ago
    I think we're getting lost in the weeds. This has almost nothing to do with the LLM. It's about A/B testing. There's a piece of software where the behavior is being changed in unannounced and unexpected ways, at least as far as the author is concerned. The same criticism could apply to any other "workflow" or "professional" software.
    There's some added flavor because the LLM is indeed non-deterministic, which could make it harder to realize that a change in behavior is caused by a change in the software, not randomness from the LLM. But there is also lots of software that deals with non-deterministic things that aren't LLMs, e.g. networks, physical sensors, scientific experiments, etc. Am I getting more timeouts because something is going on in my network or because some software I use is A/B testing some change?
sigbottle3 hours ago
OHHHH. That actually explains a lot why CC was going to shit recently. Was genuinely frustrated with that.
phreeza4 hours ago
Seems completely unsurprising?
pinum3 hours ago
Here’s the original article which was much more informative and interesting:
https://web.archive.org/web/20260314105751/https://backnotpr...
Can’t believe HN has become so afraid of generic probably-unenforceable “plz don’t reverse engineer” EULAs. We deserve to know what these tools are doing.
I’ve seen poor results from plan mode recently too and this explains a lot.
- cube00an hour ago
  Doesn't stop them going to your employer and that hint of you doing something iffy is enough to claim you're bringing the company into disrepute by drawing unwanted attention.
- vova_hn23 hours ago
  > probably-unenforceable
  It's very easy to just ban the user and if your whole workflow relies on the tool, you really don't want it.
terralumen4 hours ago
Curious what the A/B test actually changed -- the article mentions tool confirmation dialogs behaving inconsistently, which lines up with what I noticed last week. Would be nice if Anthropic published a changelog or at least flagged when behavior is being tested.
- ramoz4 hours ago
  This stemmed from me asking Claude itself why it was writing such _weird_ plans with no detail (just a bunch of projected code changes).
  Claude stated: in its system prompt, it had strict instructions to provide no context or details. Keep plans under forty lines of code. Be terse.
dep_b4 hours ago
I think stable API versions are going to be really big. I’d rather have known bugs u can work around than waking up to whatever thing got fixed that made another thing behave differently.
belabartok393 hours ago
How else are they supposed to get an authentic user test? Doctors use placebos because it doesn't work if the user knows about it.
letier4 hours ago
They do show me “how satisfied are you with claude code today?” regularly, which can be seen as a hint. I did opt out of helping to improve claude after all.
4 hours ago
undefined
mvrckhckr2 hours ago
I think it’s dishonest to use a paying client as a test subject for fundamental functionality they pay for, without their prior consent.
cerved4 hours ago
Is the a b test tired to the installation or the user?
cebert5 hours ago
This is really frustrating.
dvfjsdhgfv2 hours ago
For those confused about this submission: the original post is here:
https://web.archive.org/web/20260314105751/https://backnotpr...
heliumtera3 hours ago
Someone else has the complete power over your workflow, then it's not as yours as you claim.
casey23 hours ago
This blog looks like an ad for Claude, all it's posts are about Claude and it was made in 2026
Razengan5 hours ago
I knew it: https://news.ycombinator.com/item?id=47274796
handfuloflight5 hours ago
The ToS you agreed to gives Anthropic the right to modify the product at any time to improve it. Did you have your agent explain that to you, or did you assume a $200 subscription meant a frozen product?
- ramoz4 hours ago
  I understand. Just with AI, I don't think the behavior should change so drastically. Which I understand is paradoxical because we enjoy it when it can 10x or 1000x our workflow. I think responsible AI includes more transparency and capability control.
  - doc_ick4 hours ago
    You rent ai, you don’t own it (unless you self host).
  - witx4 hours ago
    That ship has sailed. These models were trained unethically on stollen data, they pollute tremendously and are causing a bubble that is hurting people.
    "Responsible" and "Ethic" are faaar gone.
nemo44x4 hours ago
They lose money at $200/month in most cases. Again, the old rules still apply. You are the product.
- simonw4 hours ago
  I'm confident "in most cases" is not correct there. If they lose money on the $200/month plan it's only with a tiny portion of users.
- gruez4 hours ago
  >They lose money at $200/month in most cases.
  Source? Every time I see claims on profitability it's always hand wavy justifications.
  - lwhi4 hours ago
    'Hand wavy' is one of my LLMs favourite terms.
  - nemo44x3 hours ago
    There’s a lot of articles about it. It costs them $500+ for heavy users. They do this to capture market share and also to train their agent loops with human reinforcement learning.
    https://ezzekielnjuguna.medium.com/why-anthropic-is-practica...
    gruez3 hours ago
    >There’s a lot of articles about it. ....
    >https://ezzekielnjuguna.medium.com/why-anthropic-is-practica...
    You chose a bad one. It just asserts the 95% figure without evidence and then uses it as the premise for the rest of the article. That just confirms what I said earlier about how "Every time I see claims on profitability it's always hand wavy justifications.". Moreover the article reeks of LLM-isms.
sriramgonella4 hours ago
[dead]
shablulman4 hours ago
[dead]
onion2k4 hours ago
Section 6.b of the Claude Code terms says they can and will change the product offering from time to time, and I imagine that means on a user segment basis rather than any implied guarantee that everyone gets the same thing.
b. Subscription content, features, and services. The content, features, and other services provided as part of your Subscription, and the duration of your Subscription, will be described in the order process. We may change or refresh the content, features, and other services from time to time, and we do not guarantee that any particular piece of content, feature, or other service will always be available through the Services.
It's also worth noting that section 3.3 explicitly disallows decompilation of the app.
To decompile, reverse engineer, disassemble, or otherwise reduce our Services to human-readable form, except when these restrictions are prohibited by applicable law.
Always read the terms. :)
- embedding-shape4 hours ago
  > To decompile, reverse engineer, disassemble, or otherwise reduce our Services to human-readable form, except when these restrictions are prohibited by applicable law.
  Luckily, it doesn't seem like any service was reverse-engineered or decompiled here, only a software that lived on the authors disk.
  - onion2k4 hours ago
    Again, read the terms. Service has a specific meaning, and it isn't what you're assuming.
    Don't assume things about legal docs. You will often be wrong. Get a lawyer if it's something important.
    embedding-shape4 hours ago
    Thanks for the additional context, I'm not a user of CC anymore, and don't read legal documents for fun. Seems I made the right choice in the first place :)
  - applfanboysbgon4 hours ago
    Not "service" in human speech. Service, in bullshit legalese. They define their software as
    > along with any associated apps, software, and websites (together, our “Services”)
    As far as I understand, these terms actually hold up in court, too. Which is complete fucking nonsense that, I think, could only be the result of a technologically illiterate class making the decisions. Being penalised for trying to understand what software is doing on your machine is so wholly unreasonable that it should not be a valid contractual term.
    4 hours ago
    undefined
  - doc_ick4 hours ago
    “ I dug into the Claude Code binary.”
- ozgrakkurt4 hours ago
  Why should anyone care about their TOS while they are laundering people’s work at a massive scale?
  - mcherm4 hours ago
    There are a bunch of reasons.
    Perhaps their TOS involves additional evils they are performing in the world, and it would be good to know about that.
    Perhaps their TOS is restricting the US military from misusing the product and create unmonitored killbots.
    Perhaps the person (as I do) does not feel that "laundering people's work at a massive scale" is unethical, any more than using human knowledge is unethical when those humans were allowed to spend decades reading copyrighted material in and out of school and most of what the human knows is derived from those materials and other conversations with people who didn't sign release forms before conversing.
    Just because you think one thing is bad about someone doesn't mean no one should ever discuss any other topic about them.
  - duskdozer3 hours ago
    Because by contrast they have the money and institutional capture to make your life miserable if you don't.
- pixl973 hours ago
  When a company tells you not to reverse, decompile, or disassemble their software, the first thing you should do is just that.
- doc_ick4 hours ago
  ^ this, I was about to double check on it when I saw you did. None of these practices sound abnormal, maybe a little sketchy but that comes with using llms.
- ramoz4 hours ago
  I understand. Thank you for sharing. I didn't uncover all of this until Claude told me its specific system instructions when I asked it to conduct introspection. I'll revise the blog so that I don't encourage anybody else to do deeper introspection with the tool.