Affected: All SDKs using Bayesian analysis
Symptoms
Customers may observe the following behavior during A/A experiments:
The experiment concludes early and declares a winning variation after only a few days.
One variation is reported as statistically significant relative to the other, even though both are intended to be identical.
Cause
To reduce the chance of unexpected statistical significance in an A/A experiment, use a frequentist statistical model instead of a Bayesian model.
Bayesian analysis method: Bayesian tests are more sensitive and prone to false positives, particularly in A/A scenarios where no real effect exists.
Short experiment duration: Running tests for only a few days may not capture natural traffic variation, leading to misleading conclusions.
Solution
To reduce the likelihood of unexpected statistical significance in an A/A experiment, use a frequentist statistical model instead of a Bayesian model.
Frequentist analysis is better suited to an A/A experiment because it assesses statistical significance by comparing the observed result to what would be expected if there were no difference between the control and treatment groups. This approach aligns well with an A/A test, where both variations are expected to perform the same.
Frequentist analyses also use p-values to quantify statistical significance and do not incorporate prior beliefs. This makes them a more objective option for validating that an experiment is not detecting noise as a meaningful result.
We recommend running the experiment for at least one week to account for daily traffic variation. Customers can continue running the experiment longer to collect more data before evaluating the results.
Resources