How Croct's AB testing engine works

ExperimentationBy Isabella Beatriz Silva and Bernardo Favoreto

Since we started onboarding our first customers, we often get asked how our AB testing engine works. In special, these are the three most often asked questions:

  • How are the experiment's results calculated?
  • How is the best performing variant determined?
  • How is the length of an experiment defined?

As we highlighted in the previous post of this series, we use the Bayesian approach to AB testing analysis. In this last post, we will walk you through the process of how the engine calculates the results of an experiment.

Let's dive in!

Updating experience metrics

An AB test can run for many days before showing any stable results. However, unlike engines that use the frequentist approach, our engine calculates the metrics in real-time, as new evidence (i.e., data) is incorporated into the probability distribution. It ensures that the confidence in the metrics will gradually increase as the system collects more data.

Defining a baseline

Usually, we set up AB tests to find a variant that performs better than the original (baseline). However, there won't always be a baseline (e.g., when creating a new experience from scratch), so we automatically provide mechanisms to define which variant is the baseline so you don't have to.

To define the baseline, we dynamically select the worst variant in terms of conversion rate as the baseline for comparison. It allows computing relative metrics (e.g., uplift or potential loss) to help analysts understand how a given variant is performing in comparison with the baseline.

Understanding the metrics

Before we dive into the details, let's review the most crucial metrics when working with a Bayesian AB testing engine.

Table showing a Bayesian analysis for an AB test with two variants.
Table showing a Bayesian analysis for an AB test with two variants.

Conversion rate

The conversion rate is the primary metric to understand the performance of each variant. For example, if a variant has 50 conversions in 1,000 sessions, the rate is 5%.

Uplift

Uplift is a relative metric to help users understand how a given variant is performing in comparison with the baseline. For example, if a variant has a 7% conversion rate against the baseline's 5%, the uplift is 40%.

Credible intervals

Credible intervals are helpful to represent the level of uncertainty in an estimate. It indicates the probability of the actual parameter value (i.e., the conversion rate) lying inside the interval.

An essential characteristic of credible intervals is that their length widens as the system collects more evidence, providing even more confidence about the conversion rate.

For example, constructing an interval with 1,000 data points yields:

  • 95% credible interval: 6.9% < conversion rate < 10.3%
  • Most likely conversion rate for variant A: 8.5%
  • Interval width in percentage points: 3.5%

Which reads, "with 95% confidence, the true conversion rate is somewhere between 6.9% and 10.3%". The interval length indicates a significant uncertainty (around 40% of variation) due to the small sample size of 1k data points.

Keeping the same average conversion rate of ~8.5%, the interval for 10,000 data points is:

  • 95% credible interval: 8.0% < conversion rate < 9.0%
  • Most likely conversion rate for variant A: 8.5%
  • Interval width in percentage points: 1.1%

The potential conversion rates are now a lot closer to the most likely. Moreover, the length is considerably smaller now, representing a ~12% variation.

The reasoning to provide credible intervals is because when dealing with probability, there's always uncertainty involved. Credible intervals give a perspective on the uncertainty and increase the confidence in the decision.

Probability to be best

The Probability to Be Best (PBB) is the most crucial decision metric. It's advisable to only declare a winner variant once this metric reaches a predefined threshold.

In Bayesian AB testing, computing the PBB involves simulating tens of thousands of hypothetical results. At each step, the simulation samples a random value from each variation's distribution and compares between them to verify which one "wins" that round (i.e., the one with the largest conversion rate in that round).

For example, if we're running 10,000 simulations with a threshold of 95%, one of the variants must win 9,500 or more rounds to be declared the winner. As the name suggests, the probability to be best represents precisely how likely a variant is the best among all.

A reasonable value for the winning threshold is 95%, as this confidence level is acceptable for most companies. Nonetheless, you can set it to any preferred level that suits your needs.

Potential loss

The potential loss is the second metric to understand the performance of each variant. It represents how much a variant can potentially lose in conversion rate by making a particular choice, given that the choice is wrong. For example, the average conversion rate we can lose by rolling out the winning variation when it isn't the best in reality.

Typically, a threshold of caring defines the acceptable potential loss. In the case of ties, the PBB decides the winner.

Waiting for stability

Sometimes, the results are not yet stable enough to declare a winner, even though the potential loss of both variants is not significantly different. To determine whether the experiment results are stable, you should check if:

  • Each variant received at least 1,000 visitors
  • Each variant has at least 25 conversions.

These two conditions are mainly to ensure that the data is reliable enough to be statistically significant. However, as the number of visitors and conversions increases, the best-performing variant usually becomes clearer.

Declaring a winner

Once the best-performing variant is determined, it's time to declare a winner. But first, we need to make sure that the metrics are stable enough to provide a reliable conclusion.

To determine the level of stability, we recommend using the following criteria:

  • Each variant received at least 1,000 visitors
  • Each variant has at least 25 conversions
  • The probability to be best is above the winning threshold
  • The potential loss is below the threshold of caring
  • The test is running for at least one week.

These requirements ensure the test results have reached some stability, and we don't expect it to change drastically. Failing to enforce such conditions might result in false positives.

For example, in a test with 2 variants, 1,000 sessions each, 10 conversions for variant A and 1 for variant B, the metrics should show that:

  • Variant A has a 99.93% probability of being the best
  • The potential loss of choosing A over B is 0.0000%

Thus, if the metrics were the only deciding factor, variant A would be declared the winner. However, because there is too little evidence to draw any conclusions, this decision would possibly yield an outcome worse than expected, even though the metrics indicate there's virtually no chance of that happening.

Create your free account and explore our platform by yourself.

Let's grow together!

Learn practical tactics our customers use to grow by 20% or more.