kqr
I don't think the mathematics is what gets most people into trouble. You can get by with relatively primitive maths, and the advanced stuff is really just a small-order-of-magnitude cost optimisation.

What gets people are incorrect procedures. To get a sense of all the ways in which an experiment can go wrong, I'd recommend reading more traditional texts on experimental design, survey research, etc.

- Donald Wheeler's Understanding Variation should be mandatory reading for almost everyone working professionally.

- Deming's Some Theory of Sampling is really good and covers more ground than the title lets on.

- Deming's Sample Design in Business Research I remember being formative for me also, although it was a while since I read it.

- Efron and Tibshirani's Introduction to the Bootstrap gives an intuitive sense of some experimental errors from a different perspective.

I know there's one book covering survey design I really liked but I forget which one it was. Sorry!

sebg
Hi

Have you looked into these two?

- Trustworthy Online Controlled Experiments by Kohavi, Tang, and Xu

- Statistical Methods in Online A/B Testing by Georgi Georgiev

Recommended by stats stackexchange (https://stats.stackexchange.com/questions/546617/how-can-i-l...)

There's a bunch of other books/courses/videos on o'reilly.

Another potential way to approach this learning goal is to look at Evan's tools (https://www.evanmiller.org/ab-testing/) and go into each one and then look at the JS code for running the tools online.

See if you can go through and comment/write out your thoughts on why it's written that way. of course, you'll have to know some JS for that, but it might be helpful to go through a file like (https://www.evanmiller.org/ab-testing/sample-size.js) and figure out what math is being done.

nanis
Early in the A-B craze (optimal shade of blue nonsense), I was talking to someone high up with an online hotel reservation company who was telling me how great A-B testing had been for them. I asked him how they chose stopping point/sample size. He told me experiments continued until they observed a statistically significant difference between the two conditions.

The arithmetic is simple and cheap. Understanding basic intro stats principles, priceless.

phyalow
Experimentation for Engineers: From A/B Testing to Bayesian Optimization. by David Sweet

This book is really great, and I highly recommend it, it goes broader than A/B, but covers everything quite well from a first principles perspective.

https://www.manning.com/books/experimentation-for-engineers

Maro
My blog has tons of articles about A/B testing, with math and Python code to illustrate. Good starting point:

https://bytepawn.com/five-ways-to-reduce-variance-in-ab-test...

youainti
Just as some basic context, there are two related approaches to A/B testing. The first comes from statistics, and is going to look like standard hypothesis testing of differences of means or medians. The second is from Machine Learning and is going to discuss multi-armed bandit problems. They are both good and have different tradeoffs. I just wanted you to know that there are two different approaches that are both valid.
rancar2
I once wanted a structured approach before I had access to large amounts of traffic. Once I had traffic available, the learning naturally happened (background in engineering with advanced math). If you are lucky enough to start learning through hands on experience, I’d check out: https://goodui.org/

I was lucky to get trained well by 100m+ users over the years. If you have a problem you are trying to solve, I’m happy to go over my approach to designing optimization winners repeatedly.

Alex, I will shoot you an email shortly. Also, sebg’s comment is good if you are looking for of the more academic route to learning.

nivertech
A/B Testing

An interactive look at Thompson sampling

https://everyday-data-science.tigyog.app/a-b-testing

gjstein
I'd also like to mention the classic book "Reinforcement Learning" by Sutton & Barto, which goes into some relevant mathematical aspects for choosing the "best" among a set of options. They have a full link of the PDF for free on their website [1]. Chapter 2 on "Multi-Armed Bandits" is where to start.

[1] http://incompleteideas.net/book/the-book-2nd.html

AlexeyMK
If you'd rather go through some of this live, we have a section on Stats for Growth Engineers in the Growth Engineering Course on Reforge (course.alexeymk.com). We talk through stat sig, power analysis, common experimentation footguns and alternate methodologies such as Bayesian, Sequential, and Bandits (which are typically Bayesian). Running next in October.

Other than that, Evan's stuff is great, and the Ron Kohavi book gets a +1, though it is definitely dense.

simulo
For learning about the basics of statistics, my go-to resource is "Discovering Statistics using [R/SPSS]" (Andy Field). "Improving Your Statistical Inferences" (Daniel Lakens) needs some basics, but covers a lot of intesting topics, including sequencial testing and equivalence tests (sometimes you want to know if a new thing is equivalent to the old)
austin-cheney
When I used to do A/B testing all results per traffic funnel averaged over time into cumulative results. The tests would run as long as they needed to attain statistical confidence between the funnels where confidence was the ratio of differentiation between results over time after discounting for noise and variance.

Only at test completion were financial projections attributed to test results. Don’t sugar coat it. Let people know up front just how damaging their wonderful business ideas are.

The biggest learning from this is that the financial projections from the tests were always far too optimistic compared to future development in production. The tests were always correct. The cause for the discrepancies were shitty development. If a new initiative to production is defective or slow it will not perform as well as the tests projected. Web development is full of shitty developers who cannot program for the web, and our tests were generally ideal in their execution.

epgui
In my experience the most helpful and generalizable resources have been resources on “experimental design” in biology, and textbooks on linear regression in the social sciences. (Why these fields is actually an interesting question but I don’t feel like getting into it.)

A/B tests are just a narrow special case of these.

vishnuvram
I really liked Richard McElreath’s Statistical rethinking https://youtube.com/playlist?list=PLDcUM9US4XdPz-KxHM4XHt7uU...
SkyPuncher
In my experience, there’s just not much depth to the math behind A/B testing. It all comes down to does A or B affect X parameter without negatively affecting Y. This is all basic analysis stuff.

The harsh reality is A/B testing is only an optimization technique. It’s not going to fix fundamental problems with your product or app. In nearly everything I’ve done, it’s been a far better investment to focus on delivering more features and more value. It’s much easier to build a new feature that moves the needle by 1% than it is to polish a turd for 0.5% improvement.

That being said, there are massive exceptions to this. When you’re at scale, fractions of percents can mean multiple millions of dollars of improvements.

RyJones
I worked for Ron Kohavi - he has a couple books. "Practical Guide to Controlled Experiments on the Web: Listen to Your Customers not to the HiPPO", and "Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing". I haven't read the second, but the first is easy to find and peruse.
tgtweak
No specific literature to recommend but understanding sample size and margin of error/confidence interval calculations will really help you understand a/b testing. Beyond a/b, this will help with multivariate testing as well which has mostly replaced a/b in orgs that are serious about testing.
benreesman
When seeking to both explore better treatments and also exploit good ones the mathematical formalism often used is a “bandit”.

https://en.m.wikipedia.org/wiki/Multi-armed_bandit

rgbrgb
One of my fav resources for binomial experiment evaluation + a lot of explanation: https://thumbtack.github.io/abba/demo/abba.html
_0ffh
It's maybe not what most people would recommend, but I'd suggest you read up on regret minimization and best arm indentification for multi-armed bandit problems. That way it'll probably be useful and fun! =)
daxaxelrod
Growthbook wrote a short paper on how they evaluate test results continuously.

https://docs.growthbook.io/GrowthBookStatsEngine.pdf

graycat
Bradley Efron, {\it The Jackknife, the Bootstrap, and Other Resampling Plans,\/} ISBN 0-89871-179-7, SIAM, Philadelphia, 1982.\ \
cpeterso
Anyone have fun examples of A/B tests you’ve run where the results were surprising or hugely lopsided?
crdrost
So the thing I always ctrl-F for, to see if a paper or course really knows what it's talking about, is called the “multi-armed bandit” problem. Just ctrl-F bandit, if an A/B tutorial is long enough it will usually mention them.

This is not a foolproof method, I'd call it only ±5 dB of evidence, so it would shift a 50% likely that they know what they're talking about to like 75% if present or 25% if absent, but obviously look at the rest of it and see if that's borne out. And to be clear: Even mentioning it if it's just to dismiss it, counts!

So e.g. I remember reading a whitepaper about “A/B Tests are Leading You Astray” and thinking “hey that's a fun idea, yeah, effect size is too often accidentally conditioned on whether the result was judged significantly significant which would be a source of bias” ...and sure enough a sentence came up, just innocently, like, “you might even have a bandit algorithm! But you had to use your judgment to discern that that was appropriate in context.” And it’s like “OK, you know about bandits but you are explicitly interested in human discernment and human decision making, great.” So, +5 dB to you.

And on the flip-side if it makes reference to A/B testing but it's decently long and never mentions bandits then there's only maybe a 25% chance they know what they are talking about. It can still happen, you might see e.g. χ² instead of the t-test [because usually you don't have just “converted” vs “did not convert”... can your analytics grab “thought about it for more than 10s but did not convert” etc.?] or something else that piques interest. Or it's a very short article where it just didn't come up, but that's fine because we are, when reading, performing a secret cost-benefit analysis and short articles have very low cost.

For a non-technical thing you can give to your coworkers, consider https://medium.com/jonathans-musings/ab-testing-101-5576de64...

Researching this comment led to this video which looks interesting and I’ll need to watch later about how you have to pin down the time needed to properly make the choices in A/B testing: https://youtu.be/Fs8mTrkNpfM?si=ghsOgDEpp43yRmd8

Some more academic looking discussions of bandit algorithms that I can't vouch for personally, but would be my first stops:

- https://courses.cs.washington.edu/courses/cse599i/21wi/resou... - https://tor-lattimore.com/downloads/book/book.pdf - http://proceedings.mlr.press/v35/kaufmann14.pdf

MarcAntonyX
[dead]