This is part 1 of a 5 part series on A/B Testing.
Data driven decision-making is only effective if, in the end, you do make a decision. As we have reviewed, even when you have a wealth of data it can be challenging to make a difficult decision when you have a number of different options to choose between.
But, what if you didn’t have to choose? What if you could choose all your options?
A/B Testing is a method of testing different options at the same time in the real world so that you can choose the one that performs best. It’s name comes from the idea of segmenting your customers into two groups, Group A and Group B, and testing a change to one group (the experimental group) and not another (the control group) and measuring the difference. No more guessing, you can see which of your options performs the best in reality, with actual customers in real situations. A/B Testing is used in a variety of industries and applications ranging from product design to marketing.
For example, if you have two different pricing plans and you are curious which one will generate the most revenue, you can use A/B Testing to test both plans with real customers and see which group generates the most revenue by the end of the test. Similarly, if you are curious how different marketing messages might convert inbound leads, you can A/B Test many messages to see which has the highest conversion rate with actual leads.
If it sounds too good to be true, that’s because there is a lot of complexity hiding in the details. A/B Testing has many traps and challenges you will need to overcome to use it effectively, all of which we will cover this week!
If you do use A/B Testing as part of your product or service development, you should likely use existing tools that automate much of the complexity. However, by understanding how it works and the mistakes you can make along the way, I hope you’ll be an expert in using those tools!
A/B Testing: Population Sampling
This is part 2 of a 5 part series on A/B Testing.
When running an A/B Test, the most important question to answer is how big your groups need to be so that the result is reliable. In other words, how many observations do the control and experimental groups each need to be so that you can confidently trust the result?
Spoiler Alert: Generally speaking, the answer is, a lot more than you think!
Statisticians have developed a numerical technique, called Power Analysis (also known as a sensitivity analysis), in order to determine the number of observations needed. Power Analysis relies on two inputs, the size of the change you expect to measure and the numerical confidence you want in your results. For example, you might expect to detect a 5% change in customer behavior that you want to determine with 95% confidence. However, I won’t go into the mathematics of how Power Analysis works (there is a great example here) because you should use some software to calculate it for you.
One of the sobering parts of Power Analysis is that you quickly learn that you need a LARGE sample to have trustworthy results. Much larger than you would have guessed! This is one reason that email providers recommend that, when testing email subjects, each of your control and experiment groups be at least 5,000 customers! That means, to run a single A/B Test, you need to have at least 10,000 total email recipients.
What do you do if you don’t have 10,000 customers? You need to relax one of your constraints. Either you need to look for bigger changes (20% instead of 5%) or lower your confidence requirement (80% instead of 95%). It will mean your results are less reliable but may be the best you can do with the audience you have.
A/B Testing: Multi-armed Bandit Tests
This is part 3 of a 5 part series on A/B Testing.
One of the weaknesses of A/B Testing is the time required to test only two different options. If you have many different options to choose from it can take a long time to run A/B Tests for each option (against the control) one at a time. Why can’t you run all of them at the same time?
You can, using a Multi-armed Bandit Test (also known as a multivariate test). These tests are similar to A/B Tests but are designed to test many options at the same time and to quickly move to the most effective option. It does this by dividing the problem into two parts:
Explore: This phase tests the possible options to see which performs best.
Exploit: This phase uses the best option to get the best performance.
There are a few different ways that you can run Multi-armed Bandit Tests, but here I will focus on a common method known as Epsilon-Greedy, which runs both Explore and Exploit at the same time! It does this by dividing your customers into two groups (Explore and Exploit). Typically, your Explore group will be 20% of your activity and in that group all available options will be tested side by side. The Exploit group will be the other 80% of your activity and use whichever option is performing best in the Explore tests.
Of course, at first, you have no best option so all activity will be used for Explore. However, as soon as one option shows progress it will be used for both Explore and Exploit. For this reason, the progression of your Epsilon-greedy test will look like the following:
As you can see, the test initially tests all three options at the same time, but as soon as it’s clear that Option B is the best it moves more customers to that option.
The advantage of Multi-armed Bandit Tests is that you can more quickly choose an option and avoid wasting time with sub-optimal options since it will identify and use the best options on its own. However, you need to have completely interchangeable options for this testing technique! That means it works well for testing email subject lines and colors on a website, but will be hard to apply to things like pricing and product features.
A/B Testing: Traps to Avoid
This is part 4 of a 5 part series on A/B Testing.
Whether you are running an A/B Test or a Multi-armed Bandit test, there are many common traps you can fall into, which will cause your results to be misleading. Traps, you say? We must avoid them! These are some of the most common traps and how to avoid them:
- Small samples. Yes, you may have 1 million customers but how many of them use the feature you are going to test? If you only have 100 customers using that feature you may not have a large enough sample to get reliable results from your A/B Test. Before running a test, be sure to understand the required sample size and that you have enough customer activity to create the observations you need.
- Vague Hypotheses. Your test is designed to test something, but what is that something? If you aren’t crystal clear on what you are testing and what the expected results are from the test then you won’t be able to trust results. It is not as simple as “I think Option A will generate more revenue”. You need to be specific in your test hypothesis or else you can’t guarantee that other factors influenced the result. A good test might be “I think that changing our email subject line to X will increase the email option rates by Y%”.
- Competing Tests. If you run more than one test at a time with the same group of customers, your tests may be competing with each other. How will you know if the improvement you see in Option A from Test 1 is real or a result of Option B of Test 2? Running multiple tests can cause all sorts of data pollution when tests share the same customers.
That last trap is a doozy! I have no doubt that you want to run more than one test at a time, but if doing so jeopardizes all of your tests what should you do? We’ll cover that tomorrow when we go over running simultaneous tests on the same customers.
A/B Testing: Simultaneous Tests
This is part 5 of a 5 part series on A/B Testing.
Everything we have discussed this week assumes you are testing one aspect of your product or service using an A/B or Multi-armed Bandit Test. In the real world, I’m sure you have dozens of different features you would like to test every day! Running simultaneous tests is possible, but dangerous.
The worst case scenario, which happens all the time, is that the same customer is exposed to multiple tests, but each interaction tests different features. Having the same customer experience multiple, different tests means that it will be hard for you to discern which features affected that customers behavior! Was it Option A of Test 1 that increased purchases, or was it Option B of Test 2? The more tests you are running the harder it might be to tell the difference.
There are some cases where running multiple A/B Tests at the same time should not cause you any problems:
- If the groups of customers exposed to each test are mutually exclusive, so that no customers participating in Test 1 are also participating in Test 2.
- If the overlap between customers in Test 1 and Test 2 is very small (say 1% of all customers) so any error introduced should be minor.
- If the tests are of features so distinct and different that they cannot influence the same customer behavior(s).
If you need to run multiple tests but cannot meet one of those criteria, you will need to use your judgement. Experts differ on whether it’s a good idea to run overlapping, competitive tests. In my experience the arguments for both sides are:
- Yes, Run Simultaneous Tests. The danger in having error and bias in your test results is better than having no data to make the decision at all.
- No, Don’t Run Simultaneous Tests. There is no point in running a test if you cannot clearly rely on the results.
Personally, I think it comes down to whether the test is an input to your decision or if it makes the decision. In the former case it is fine to have some uncertainty because you will make the final call. If the test itself is making the decision (such as in a Multi-Armed Bandit Test) you need to isolate your testing because the computer algorithm is going to automatically make the choice without considering the potential bias introduced.
Testing is a skill that improves with use, so the more you use it the better you will get. Time to start testing!