Statistical Significance

Every experiment needs a clear answer: did this change actually work, or was it just noise? Elevate uses Bayesian statistical modeling to continuously evaluate your experiment results and give you a reliable answer.

This page explains how Elevate determines whether your results are trustworthy, what each status means, and when it's safe to act on what you see.


How It Works

When your experiment is running, Elevate collects visitor and conversion data for each variation. Behind the scenes, a Bayesian model evaluates the probability that each variation is the best performer across your selected goal metric.

The model accounts for the type of metric being measured:

  • Discrete metrics like Conversion Rate, Add to Cart Rate, and Checkout Start Rate are evaluated using a binary outcome model — did the visitor convert or not.

  • Continuous metrics like Revenue Per Visitor, Average Order Value, and Profit Per Visitor use a model designed for revenue-style data, which naturally includes a mix of zero and non-zero values.

As more data flows in, the model updates and the probabilities stabilize. This is the core of how Elevate moves from uncertainty to a confident result.


Experiment Statuses

Every running experiment displays a status that reflects how confident you can be in the current results. These statuses update automatically as data is collected.

Collecting Data

The experiment has started but hasn't yet reached the minimum data thresholds. Results shown during this phase are preliminary and should not be acted on.

Before any statistical evaluation begins, Elevate requires:

  • At least 3 days of runtime — this protects against the novelty effect, where visitors may react strongly to something simply because it's new, not because it's actually better.

  • A minimum number of visitors and conversions per variation — the exact thresholds depend on your store's monthly order volume (see Data Thresholds below).

Early data suggests that a variation is outperforming the control, but the probability hasn't crossed the 75% mark yet. Results are directional — they hint at a potential winner, but there isn't enough evidence to confirm it.

Keep the experiment running.

Early data suggests the control is outperforming the variation(s). Like Trending Positive, this is directional and not yet confirmed. The experiment needs more data before you can draw conclusions.

Near Significance

The probability of the leading variation being the true winner is between 75% and 90%. The experiment is getting close to a definitive result. At this stage, the data is showing a consistent pattern, but Elevate holds off on calling it significant until the threshold is crossed.

This is a good sign — don't end the experiment early.

Significant

The probability has crossed 90%, meaning there's strong statistical evidence that the winning variation genuinely outperforms the others. This is the point where you can confidently act on the results — whether that means applying the winning variation to your store or using the insight to inform your next experiment.

Not Significant

The experiment was completed without reaching the 90% probability threshold. The data collected wasn't strong enough to declare a clear winner. This doesn't mean the experiment failed — it means the difference between variations was too small to measure reliably, or the experiment didn't run long enough.

Unlikely to be Significant

Elevate has determined that, based on the current trajectory, reaching statistical significance would take an unreasonably long time. This usually means the true difference between your control and variation is very small. Consider ending the experiment and testing a bolder change.


Data Thresholds

Elevate adjusts its minimum requirements based on your store's size to balance speed with reliability. Stores with higher traffic can afford stricter thresholds, while smaller stores get lower minimums so experiments don't stall.

Small Stores
Large Stores

Visitors per variation

500

2,500

Conversions per variation

20

100

Minimum runtime

3 days

3 days

Checkout experiments always use the small store thresholds, regardless of store size, because checkout traffic is naturally lower than page-level traffic.

The "large store" threshold applies to stores processing $100K+ in monthly revenue. If your store falls below this, the small store thresholds apply.

These are the minimum requirements for statistical evaluation to begin. In general, more data leads to more reliable results. Let your experiments run until they reach a definitive status.


Key Metrics on Your Experiment Report

When viewing an experiment, Elevate displays several key indicators at the top of your report.

Probability to Win

This is the core output of the Bayesian model. It represents the estimated likelihood that each variation is the best performer for your selected goal metric. The probabilities across all variations always add up to 100%.

As the experiment collects more data, the probabilities stabilize. Early on, they may swing — that's normal. Over time, a clear leader usually emerges.

If no variation has a probability above 50%, there is no clear leader yet.

Projected Revenue

A forward-looking estimate of how much additional revenue the winning variation could generate over the next 30 days, based on the current rate of visitors and the performance difference between variations. This metric appears once the experiment has moved past the Collecting Data phase.

The projection is calculated using your store's actual visitor velocity — how many unique visitors the experiment page receives, extrapolated over a 30-day window.

Experiment Goal

The primary metric your experiment is optimizing for. Elevate supports the following goals:

Goal
What It Measures

Conversion Rate

Percentage of visitors who complete a purchase

Add to Cart Rate

Percentage of visitors who add an item to their cart

Checkout Start Rate

Percentage of visitors who begin the checkout process

Revenue Per Visitor

Average revenue generated per unique visitor

Average Order Value

Average dollar amount per completed order

Profit Per Visitor

Average profit generated per unique visitor (requires cost data)

The goal you select determines how the Bayesian model evaluates winner probabilities.


Advanced Statistical Details

For those who want to go deeper, Elevate provides an advanced details view on each experiment report with additional statistical metrics.

Statistical Power

The probability that the experiment will correctly detect a real difference between variations, given the current sample size and effect size. Higher power means less risk of missing a real improvement. Elevate calculates power based on a target of detecting a 10% relative improvement in your goal metric.

Power below 80% means the experiment may not have enough data to detect smaller but meaningful differences.

Minimum Detectable Effect (MDE)

The smallest difference between your control and variation that your experiment is capable of reliably detecting at the current sample size. A smaller MDE means your experiment can catch subtler differences. If your MDE is larger than the actual difference, the experiment may not reach significance — it's not that the variation doesn't work, it's that the effect is too small for the current data to measure.

Credible Interval

A 90% credible interval for the control's conversion rate, calculated using the Wilson score method. This tells you the range within which the true conversion rate is very likely to fall. A narrow interval means high confidence in the observed rate; a wide interval means more data is needed.

Expected Loss / Expected Gain

Estimates the potential revenue impact per 100 visitors of choosing one variation over another. If the control is winning, this appears as "Expected Loss" — the revenue you'd lose by implementing the variation. If a variation is winning, it appears as "Expected Gain."


Common Questions

My probability to win is high, but the status isn't Significant yet. Why?

The probability to win is one piece of the puzzle. Elevate also requires minimum time and data thresholds to be met before declaring significance. Early probabilities can be volatile — a high number on day two doesn't carry the same weight as a high number after two weeks of consistent data. Let the experiment run until the status updates.

How long should I run my experiment?

There's no fixed answer — it depends on your traffic volume and the size of the effect. A bold change on a high-traffic page might reach significance in under a week. A subtle change on a lower-traffic page could take several weeks. The status labels are your best guide. If you're seeing "Collecting Data" for an extended period, it may mean your page doesn't get enough traffic for that particular experiment.

What if my experiment ends as Not Significant?

This means the difference between your control and variation was too small to measure with confidence — or the experiment didn't collect enough data. It's not a failure. You've learned that the tested change likely doesn't have a meaningful impact, which is valuable information. Consider testing a bigger, bolder change next time.

Should I stop an experiment early if one variation is clearly winning?

No. Early leads can be misleading, especially in the first few days. The minimum runtime and sample requirements exist specifically to prevent premature conclusions. Wait for the status to reach Significant before making decisions.

What does it mean when the control is winning?

It means your original experience is outperforming the variation(s). This is a valid and useful result — it confirms that your current setup is strong for that particular element. You can end the experiment and move on to testing something else.

Last updated