More frequently, A/B testing is a political technology that allows teams to move forward with changes to core, vital services of a site or app. By putting a new change behind an A/B test, the team technically derisks the change, by allowing it to be undone rapidly, and politically derisks the change, by tying it's deployment to rigorous testing that proves it at least does no harm to the existing process before applying it to all users. The change was judged to be valuable when development effort went into it, whether for technical, branding or other reasons.
In short, not many people want to funnel users through N code paths with slightly different behaviors, because not many people have a ton of users, a ton of engineering capacity, and a ton of potential upside from marginal improvements. Two path tests solve the more common problem of wanting to make major changes to critical workflows without killing the platform.
I just want to drop here the anecdata that I've worked for a total of about 10 years in startups that proudly call themselves "data-driven" and which worshipped "A/B testing." One of them hired a data science team which actually did some decently rigorous analysis on our tests and advised things like when we had achieved statistical significance, how many impressions we needed to have, etc. The other did not and just had someone looking at very simple comparisons in Optimizely.
In both cases, the influential management people who ultimately owned the decisions would simply rig every "test" to fit the story they already believed, by doing things like running the test until the results looked "positive" but not until it was statistically significant. Or, by measuring several metrics and deciding later on to make the decision based on whichever one was positive [at the time]. Or, by skipping testing entirely and saying we'd just "used a pre/post comparison" to prove it out. Or even by just dismissing a 'failure,' saying we would do it anyway because it's foundational to X, Y, and Z which really will improve (insert metric) The funny part is that none of these people thought they were playing dirty, they believed that they were making their decisions scientifically!
Basically, I suspect a lot of small and medium companies say they do "A/B testing" and are "data-driven" when really they're just using slightly fancy feature flags and relying on some director's gut feelings.
The worst is surely when management make the investments in rigor but then still ignores the guidance and goes with their gut feelings that were available all along.
If A/B testing data is weak or inconclusively, and you’re at a startup with time/financial pressure, I’m sure it’s almost always better to just make a decision and move on than to spend even more time on analysis and waiting to achieve some fixed level of statistical power. It would be a complete waste of time for a company with limited manpower that needs to grow 30% per year to chase after marginal improvements.
It went really well, and then nobody ever tried it again.
see also Scrum and Agile. Or continuous deployment. Or anything else that's hard to do well, and easier to just cargo-cult some results on and call it done.
The one place that A/B testing seem to have a huge impact was on the acquisition flow and onboarding, but not in the actual product per se.
I’ve been in companies that have tried dozens if not hundreds of A/B tests with zero statistically significant results. I figure by the law of probabilities they would have gotten at least a single significant experiment but most products have such small user bases and make such large changes at a time that it’s completely pointless.
All my complaints fell on deaf ears until the PM in charge would get on someone’s bad side and then that metric would be used to push them out. I think they’re largely a political tool like all those management consultants that only come in to justify an executive’s predetermined goals.
What I've seen in practice is that some places trust their designers' decisions and only deploy A/B tests when competent people disagree, or there's no clear, sound reason to choose one design over another. Surprise surprise, those alternatives almost always test very close to each other!
Other places remove virtually all friction from A/B testing and then use it religiously for every pixel in their product, and they get results, but often it's things like "we discovered that pink doesn't work as well as red for a warning button," stuff they never would have tried if they didn't have to feed the A/B machine.
From all the evidence I've seen in places I've worked, the motivating stories of "we increased revenue 10% by a random change nobody thought would help" may only exist in blog posts.
I think a/b tests are still good for measuring stuff like system performance, which can be really hard to predict. Flipping a switch to completely change how you do caching can be scary.
It's pretty common for one person to have an issue that no other people have, just because they fell for some feature flag.
It just depends on the goals of the business.
Once the user has committed to paying they probably will put up with whatever annoyance you put in their way, also if they are paying if something is _really_ annoying they often contact the SaaS people.
Most SaaS don't really care that much about "engagement" metrics (ie keeping users IN the product). These are the kinda of metrics are are the easiest to see move.
In fact most people want a product they can get in and out ASAP and move on with their lives.
For example, I worked on a new feature for a product, and the engagement metrics showed a big increase in engagement by several customers' users, and showed that their users were not only using our software more but also doing their work much faster than before. We used that to justify raising our prices -- customers were satisfied with the product before, at the previous rates, and we could prove that we had just made it significantly more useful.
I know of at least one case where we shared engagement data with a power user at a customer who didn't have purchase authority but was able to join it with their internal data to show that use of our software correlated with increased customer satisfaction scores. They took that data to their boss, who immediately bought more seats and scheduled user training for all of their workers who weren't using our software.
We also used engagement data to convince customers not to cancel. A lot of times people don't know what's going on in their own company. They want to cancel because they think nobody is using the software, and it's important to be able to tell them how many daily and hourly users they have on average. You can also give them a list of the most active users and encourage them to reach out and ask what the software does for them and what the impact would be of cancelling.
Well, at least it looks like they avoided p-hacking to show more significance than they had! That's ahead of much of science, alas.
Yea, I've been here too. And in every analytics meeting everyone went "well, we know it's not statistically significant but we'll call it the winner anyway". Every. Single. Time.
Such a waste of resources.
If your error bar for some change goes from negative 4 percent to positive 6 percent, it may or may not be better, but it's safe to switch to.
I think the disconnect here is some people thinking A/B testing is something you try once a month, and someplace like Amazon where you do it all the time and with hundreds of employees poking things.
It’s helpful in continuous delivery setups since you can test and deploy the functionality and move the bottleneck for releasing beyond that.
Re: b), if you've ever gotten into a screaming match with a game designer angry over the removal of their pet feature, you will really appreciate the political cover that having numbers provides...
It’s complicated.
It becomes an A/B test when you measure user activity to decide whether to roll out to more users.
Have my error logs gotten bigger? No.
Have my tech support calls gone up? No.
Okay then turn the dial farther.
Why they might conflate A/B testing with gradual rollout is control over who gets the feature flag on and who doesn't.
In a sense, A/B testing is a variant of gradual rollout, where you've done it so you can see differences in feature "performance" (eg. funnel dashboards) vs just regular observability (app is not crashing yet).
Basically, a gradual rollout for the purposes of an A/B test.
Imagine that you are running MAB on an website with a control/treatment variant. After a bit you end up sampling the treatment a little more, say 60/40. You now start running a sale - and the conversion rate for both sides goes up equally. But since you are now sampling more from the treatment variant, its aggregate conversion rate goes up faster than the control - you start weighting even more towards that variant.
Fluctuating reward rates are everywhere in e-commerce, and tend to destabilise MAB proportions, even on two identical variants, they can even cause it to lean towards the wrong one. There are more sophisticated MAB approaches that try to remove the identical reward-rate assumption - they have to model a lot more uncertainty, and so optimise more conservatively.
If the conversion rate "goes up equally", why did you not measure this and use that as a basis for your decisions?
> its aggregate conversion rate goes up faster than the control - you start weighting even more towards that variant.
This sounds simply like using bad math. Wouldn't this kill most experiments that start with 10% for the variant that do not provide 10x the improvement?
The problem here is that the weighting of the alternatives changes over time and the thing you are measuring may also change. If you start by measuring the better option, but then bring in the worse option in a better general climate, you could easily conclude the worse option is better.
To give a concrete example, suppose you have two versions of your website, one in English and one in Japanese. Worldwide, Japanese speakers tend to be awake at different hours than English speakers. If you don't run your tests over full days, you may bias the results to one audience or the other. Even worse, weekend visitors may be much different than weekday visitors so you may need to slow down to full weeks for your tests.
Changing tests slowly may mean that you can only run a few tests unless you are looking at large effects which will show through the confounding effects.
And that leads back to the most prominent normal use which is progressive deployments. The goal there is to test whether the new version is catastrophically worse than the old one so that as soon as you have error bars that bound the new performance away from catastrophe, you are good to go.
Eg. I could sum up 10 (decimal) and 010 (octal) as 20, but because they were the same digits in different numbering systems, you need to normalize the values first to the same base.
Or I could add up 5 GBP, 5 USD, 5 EUR and 5 JPY and claim I got 20 of "currency", but it doesn't really mean anything.
Otherwise, we are comparing incomparable values, and that's bad math.
Sure, percentages is what everybody gets wrong (hey percentage points vs percentage), but that does not make them not wrong. And knowing what is comparable when you simply talk in percentages, even more so (as per your examples).
There are three kinds of lies: lies, damn lies, and statistics
If you aren’t testing at exactly 50/50 - and you can’t because my plan for visiting a site and for how long will never be equivalent to your plan, then any other factors that can affect conversion rate will cause one partition to go up faster than the other. You have to test at a level of Amazon to get statistical significance anyway.And as many if us have told people until they’re blue in the face: we (you) are not a FAANG company and pretending to be one won’t work.
I found a post [2] of doing some very rudimentary testing on EXP3 against UCB to see if it performs better in what could be considered an adversarial environment. From what I can tell, it didn't perform all that well.
Do you, or anyone else, have an actual use case for when EXP3 performs better than any of the standard alternatives (UCB, TS, EG)? Do you have experience with running MAB in adversarial environments? Have you found EXP3 performs well?
[0] https://news.ycombinator.com/item?id=42650954#42686404
[1] https://jamesrledoux.com/algorithms/bandit-algorithms-epsilo...
[2] https://www.jeremykun.com/2013/11/08/adversarial-bandits-and...
And with physical retailers with online catalogs, an online sale of one item may cannibalize an in-store purchase of not only that item but three other incidental purchases.
But at the end of the day your 60/40 example is just another way of saying: you don’t try to compare two fractions with a different denominator. It’s a rookie mistake.
Out of curiosity, where did you work? In the same space as you.
Or in the above scenario option B performs a lot better than option A but only with the sale going, otherwise option B performs worse.
We weren’t at the level of hacking our users, just looking at changes that affect response time and resource utilizations, and figuring out why a change actually seems to have made things worse instead of better. It’s easy for people to misread graphs. Especially if the graphs are using Lying with Statistics anti patterns.
The average base rate for the first variant is 5.3%, the second is 6.4%. Generally the favoured variant's average will shift faster because we are sampling it more.
While it's non-obvious this is the effect, anyone analyzing the results should be aware of it and should only compare weighted averages, or per distinct time periods.
And therein is the largest problem with A/B testing: it's mostly done by people not understanding the math subtleties, thus they will misinterpret results in either direction.
Additionally, doing randomization on a per-request basis heavily limits the kinds of user behaviors you can observe. Often you want to consistently assign the same user to the same condition to observe long-term changes in user behavior.
This approach is pretty clever on paper but it's a poor fit for how experimentation works in practice and from a system design POV.
That being said, I agree that MABs are poor for experimentation (they produce biased estimates that depend on somewhat hard-to-quantify properties of your policy). But they're not for experimentation! They're for optimizing a target metric.
I think Uber gets away with it because it’s time and location based, not person based. Of course if someone starts pointing out that segregation by neighborhoods is still a thing, they might lose their shiny toys.
So first time user touches feature A they are assigned to some trial arm T_A and then all subsequent interactions keep them in that trial arm until the trial finishes.
> Naively, you could take the random integer and compute the remainder of the division by the size of the interval. It works because the remainder of the division by D is always smaller than D. Yet it introduces a statistical bias
That's all it says. Is the point here just that 2^31 % 17 is not zero, so 1,2,3 are potentially happening slightly more than 15,16? If so, this is not terribly important
It is not uniformly random, which is the whole point.
> That article is mostly about speed
The article is about how to actually achieve uniform random at high speed. Just doing mod is faster but does not satisfy the uniform random requirement.
If you crave more bandits: https://jamesrledoux.com/algorithms/bandit-algorithms-epsilo...
In my experience, obsessing on the best decision strategy is the biggest honeypot for engineers implementing MAB. Epsilon-greedy is very easy to implement and you probably don't need anything more. Thompson sampling is a pain in the butt, for not much gain.
In a normal universe, you just import a different library, so both are the same amount of work to implement.
Multiarmed bandit seems theoretically pretty, but it's rarely worth it. The complexity isn't the numerical algorithm but state management.
* Most AB tests can be as simple as a client-side random() and a log file.
* Multiarmed bandit means you need an immediate feedback loop, which involves things like adding database columns, worrying about performance (since each render requires another database read), etc. Keep in mind the database needs to now store AB test outcomes and use those for decision-making, and computing those is sometimes nontrivial (if it's anything beyond a click-through).
* Long-term outcomes matter more than short-term. "Did we retain a customer" is more important than "did we close one sale."
In most systems, the benefits aren't worth the complexity. Multiple AB tests also add testing complexity. You want to test three layouts? And three user flows? Now, you have nine cases which need to be tested. Add two color schemes? 18 cases. Add 3 font options? 54 cases. The exponential growth in testing is not fun. Fire-and-forget seems great, but in practice, it's fire-and-maintain-exponential complexity.
And those conversion differences are usually small enough that being on the wrong side of a single AB test isn't expensive.
Run the test. Analyze the data. Pick the outcome. Kill the other code path. Perhaps re-analyze the data a year later with different, longer-term metrics. Repeat. That's the right level of complexity most of the time.
If you step up to multiarm, importing a different library ain't bad.
So if you are doing A/B tests, it is quite reasonable to use Thompson sampling at fixed intervals to adjust the proportions. If your response variable is not time invariant, this is actually best practice.
Whereas the comment you’re responding to is rightly pointing out that for most orgs, the marginal gains of using an approach more complex than Epsilon greedy probably aren’t worth it. I.e., the juice isn’t worth the squeeze.
The difference in performance is smaller and the difference in complexity is much greater. Optimized FFTs are... hairy. But now that someone wrote them, free.
I'm talking about the difference between epsilon-greedy vs. a more complex optimization scheme within the context of implementing MAB. You're making arguments about A/B testing vs MAB.
What's nice about AB testing is the decision can be made on point estimates, provided the two choices don't have different operational "costs". You don't need to know that A is better than B, you just need to pick one and the point estimate gives the best answer with the available data.
I don't know of a way to determine whether A is better than B with statistical significance without letting the experiment run, in practice, for way too long.
(But, it's more likely that you don't know if there's a significant effect size)
https://github.com/raffg/multi_armed_bandit
It shows 10% exploration performs the best, very great simple algorithm.
Also it shows the Thompson Sampling algorithm converges a bit faster-- the best arm chosen by sampling from the beta distribution, and eliminates the explore phase. And you can use the builtin random.betavariate !
https://github.com/raffg/multi_armed_bandit/blob/42b7377541c...
Statistical significance is statistical significance, end of story. If you want to show that option B is better than A, then you need to test B enough times.
It doesn't matter if you test it half the time (in the simplest A/B) or 10% of the time (as suggested in the article). If you do it 10% of the time, it's just going to take you five times longer.
And A/B testing can handle multiple options just fine, contrary to the post. The name "A/B" suggests two, but you're free to use more, and this is extremely common. It's still called "A/B testing".
Generally speaking, you want to find the best option and then remove the other ones because they're suboptimal and code cruft. The author suggests always keeping 10% exploring other options. But if you already know they're worse, that's just making your product worse for those 10% of users.
Can you share some more details about your experiences with those particular types of failures?
Another common one I saw was due to different systems handling different treatments, and there being caching discrepancies between the two, like esp in a MAB where allocations are constantly changing, if one system has a much longer TTL than the other then you might see allocation lags for one treatment and not the other, biasing the data. Or perhaps one system deploys much more frequently and the load balancer draining doesn't wait for records to finish uploading before it kills the process.
The most subtle ones were eligibility biases, where one treatment might cause users to drop out of an experiment entirely. Like if you have a signup form and you want to measure long-term retention, and one treatment causes some cohorts to not complete the signup entirely.
There are definitely mitigations for these issues, like you can monitor the expected vs. actual allocations and alert if they go out-of-whack. That has its own set of problems and statistics though.
You get zero benefits from MAB over A/B if you simply end your A/B test once you've achieved statistical significance and pick the best option. Which is what any efficient A/B test does -- there no reason to have any fixed "testing period" beyond what is needed to achieve statistical significance.
While, to the contrary, the MAB described in the article does not maximize reward -- as I explained in my previous comment. Because the post's version runs indefinitely, it has worse long-term reward because it continues to test inferior options long after they've been proven worse. If you leave it running, you're harming yourself.
And I have no idea what you mean by MAB "generalizing" more. But it doesn't matter if it's worse to begin with.
(Also, it's a huge red flag that the post doesn't even mention statistical significance.)
I disagree. There is a vast array of literature on solving the MAB problem that may as well be grouped into a bin called “how to optimally strike a balance between having one’s cake and eating it too.”
The optimization techniques to solve MAB problem seek to optimize reward by giving the right balance of exploration and exploitation. In other words, these techniques attempt to determine the optimal way to strike a balance between exploring if another option is better and exploiting the option currently predicted to be best.
There is a strong reason this literature doesn’t start and end with: “just do A/B testing, there is no better approach”
If you want to get sophisticated, MAB properly done is essentially just A/B testing with optimal strategies for deciding when to end individual A/B tests, or balancing tests optimally for a limited number of trials. But again, it doesn't "beat" A/B testing -- it is A/B testing in that sense.
And that's what I mean. You can't magically increase your reward while simultaneously getting statistically significant results. Either your results are significant to a desired level or not, and there's no getting around the number of samples you need to achieve that.
> MAB properly done is essentially just A/B testing
Words are only useful insofar as their meanings invoke ideas, and in my experience absolutely no one thinks of other MAB strategies when someone talks about A/B testing.
Sure, you can classify A/B testing as one extremely suboptimal approach to solving MAB problem. This classification doesn’t help much though, because the other MAB techniques do “magically increase the rewards” compared this simple technique.
You are quite simply wrong. There is nothing suboptimal about an A/B test between two choices performed until desired statistical significance. There is nothing you can do to magically increase anything.
If you think there is, you'll have to describe something specific. Because nowhere in the academic MAB literature does anyone attempt to state the contrary. And which, again, is why this blog post is so flawed.
The trick is to find the exact best number of test for each color so that we have good statistical significance. MAB does not do that well, as you cannot easily force testing an option that was bad when this option did not get enough trial to have a good statistical significance (imagine you have 10 colors and the color orange first score 0/1. It will take a very long while before this color will be re-tested quite significantly: you need to first fall into the 10%, but then you still have ~10% to randomly pick this color and not one of the other). With A/B testing, you can do a power analysis before hand (or whenever during) to know when to stop.
Literature does not start with "just do A/B testing" because it is not the same problem. In MAB, your goal is not to demonstrate that one is bad, it's to do your own decision when faced with a fixed situation.
Yes, A/B testing will force through enough trials to get statistical significance(it is definitely a “exploration first strategy), but in many cases, you care about maximizing reward as well, in particular during testing. A/B testing does very poorly at balancing exploitation with exploitation in general.
This is especially true if the situation is dynamic. Will you A/B test forever in case something has changed and give up that long term loss in reward value?
With the A/B testing, you can do power analysis whenever you want, including in the middle of the experiment. It will just be an iterative adjustment that converges.
In fact, you can even run on all possibilities in advance (if A get 1% and B get 1%, how many A and B do I need, if A get 2% and B get 1%, if A get 3% and B get 1%, ...) and it will give you the exact boundaries to stop for any configurations before even running the experiment. You will just have to stop trialing option A as soon as option A crosses the already decided significance threshold for A.
So, no, the A/B testing will never run forever. And A/B testing will always be better than the MAB solution, because you will have a better way to stop trying a bad solution as soon as you have crossed the threshold you decided is enough to consider it's a bad solution.
You can solve this with propensity scores, but it is more complicated to implement and you need to log every interaction.
You can add a forgetting factor for older results.
Figuring out best features is a completely different problem.
The caveats (perhaps not mentioned in the article) are: - Perhaps you have many metrics you need to track/analyze (CTR, conversion, rates on different metrics), so you can't strictly do bandit! - As someone mentioned below, sometimes the situation is dynamic (so having evenly sized groups helps with capturing this effect) - Maybe some other ones I can't think of?
But you can imagine this kind of auto-testing being useful... imagine AI continually pushes new variants, and it just continually learns which one is the best
It's useful as long as your definition is good enough and your measurements and randomizations aren't biased. Are you monitoring this over time to ensure that it continues to hold? If you don't, you risk your MAB converging on something very different from what you would consider "the best".
When it converges on the right thing, it's better. When it converges on the wrong thing, it's worse. Which will it do? What's the magnitude of the upside vs downside?
# for each lever,
# calculate the expectation of reward.
# This is the number of trials of the lever divided by the total reward
# given by that lever.
# choose the lever with the greatest expectation of reward.
If I'm not mistaken, this pseudocode has a bug that will result in choosing the expected worst option rather than the expected best option. I believe it should read "total reward given by the lever divided by the number of trials of that lever".We talk about it here: https://blog.growthbook.io/introducing-multi-armed-bandits-i...
CSS to make it noscript friendly: `.main { visibility: visible !important; max-width: 710px; }`
Another thing that I noticed while writing the code (and took advantage of) is that it is insanely scalable in a distributed system, using HyperLogLog and Bloom filters. I may or may not have totally over-engineered my code to take advantage of that, even though my site gets ridiculously low click numbers :-)
> hundreds of the brightest minds of modern civilization have been hard at work not curing cancer. Instead, they have been refining techniques for getting you and me to click on banner ads
I was really hoping this would slowly develop into a statistical technique couched in terms of ad optimization but actually settling in on something you might call ATCG testing (e.g. the biostatistics methods that one would indeed use to cure cancer).
However all this fails. For optimal output (be it drug research, allocation of brains, how to run a life), putting all resources on the problem/thing that is the "most important" is sub-optimal use of resources. It's always better expected return to allocate resources to where that spent resource has the best return. If that place is apps, not cancer, then wishing for brains to work on cancer because some would view that as a more important problem may simply be a waste of brains.
So if cancer is going to be incredibly hard to solve, and mankind empirically gets utility from better apps, then a better use is to put those brains on apps - then they're not wasted on an probably not solvable problem and are put to use making things that do increase value.
He also ignores that in real life the cost to have a zillion running experiments constantly flipping alternatives does not scale, so in no way can a company at scale replace A/B with multiarm bandits. One reason is simple: at any time a large company is running 1000s to maybe 100k A/B tests, each running maybe 6 months, at which point code path is selected, dead paths removed, and this repeats continually.If that old code is not killed, and every feature from all time needs to be on/off randomly, then there is no way over time to move much of the app forwards. It's not effective or feasibly to build many new features if you must also allow interacting with those from 5-10 years ago.
A simple google shows tons more reasons, form math to practical, that this post is bad advice.
It's not hard to keep track of which arm any given user was exposed to in the first run, and then repeat it.
20 lines of code that beat A/B testing (2012) - https://news.ycombinator.com/item?id=11437114 - April 2016 (157 comments)
20 lines of code that beat A/B testing every time - https://news.ycombinator.com/item?id=4040022 - May 2012 (147 comments)
For simple, immediate-feedback cases like button clicks, the specific implementation becomes less critical.
random.seed(hash(user_id))
I think the bigger problem is handling the fact that not all users click through the same number of times.You are correct that this setup can potentially mislead you, but this is because you might end up getting estimators with high variance. So, you might mistakenly see some early promising results for experiment group A and greedily assign all the requests to that group, even though it is not guaranteed that A is actually better than B.
This is the famous exploration-exploitation dilemma—should you maximize conversions by diverting everyone to group A or still try to collect more data from group B?
Meanwhile, if your users get presented a different button whenever they come by, because the MAB is still pursuing its hill climbing, they'll rightfully accuse you of having extremely crappy UX. (And, sure, you can have MAB with user stickiness, but now you do need to talk about sampling bias)
And MAB hill climb doesn't work at all if you want to measure the long-term reward of a variation. You have no idea if the orange button has long-term retention impact. There are sure situations where you'd like to know.
Yes, it's a neat technique to have in your repertoire, but like any given technique, it's not the answer "every time".
(Of course, the whole point is that the benefit and safety are not certain, so I think the term "sacrifice" used in the article is misleading.)