User Research

A/B testing plan template: A step-by-step framework for product teams

Most A/B tests fail because of poor planning, not poor execution. This template provides a step-by-step framework for structuring experiments that produce statistically valid, actionable results.

CleverX Team ·
A/B testing plan template: A step-by-step framework for product teams

Most A/B tests fail before they start.

Not because the variant was wrong or the sample was too small, but because the test was never properly planned. Teams skip the hypothesis, pick metrics that do not align with the business question, launch without calculating sample size, and then peek at results daily until something looks significant.

A written A/B testing plan prevents these mistakes. It forces clarity on what you are testing, why you are testing it, how you will measure success, and what decision you will make based on each possible outcome. Without this structure, experiments become opinion-laundering exercises where teams cherry-pick data to confirm what they already believed.

This template provides a complete framework for planning, documenting, and analyzing A/B tests that produce statistically valid, actionable results. It works for product teams, growth teams, and UX teams running experiments on any digital product.

Key takeaways

  • Every A/B test needs a written plan created before the test launches, not after results come in
  • Hypotheses must be specific and grounded in evidence from user research, analytics, or support data
  • Define primary, secondary, and guardrail metrics upfront to prevent post-hoc metric shopping
  • Calculate sample size before launching so you know how long the test needs to run
  • Pre-commit to decision criteria so results interpretation is not influenced by which variant is winning
  • Document learnings from every test, including failures, to build institutional experiment knowledge

What should an A/B testing plan include?

A complete A/B testing plan covers seven sections: background, hypothesis, variants, metrics, sample size, implementation, and analysis. Each section serves a specific purpose in preventing common experiment failures.

The template below is ready to copy and adapt for your team.

A/B testing plan template

Section 1: Test overview

FieldValue
Test name[Descriptive name identifying what is being tested]
Test ID[Internal tracking ID]
StatusPlanning / Running / Analysis / Complete
Owner[Name, team]
Stakeholders[Teams or individuals who need to be informed]
Planned start date[Date]
Planned end date[Date based on sample size calculation]

Section 2: Background and research basis

Context: [Brief description of the product area being tested and relevant history. Include previous test results in this area, known performance issues, or user research findings that motivated this experiment.]

Problem or opportunity: [What specific problem are you solving? Be precise. “Increase conversion” is too vague. “Reduce 23% drop-off at payment method selection, where usability testing revealed users struggle with option clarity” is specific enough to design a test around.]

Evidence supporting this test: [Cite specific data. Examples:]

  • [User research finding]: “[Quote or insight from research session]”
  • [Analytics data]: “[Metric showing the problem exists and its magnitude]”
  • [Support tickets]: “[Pattern from customer complaints or questions]”
  • [Heuristic review]: “[Issue identified through expert evaluation]”

Alternative explanations: [What else could explain the problem? If users are dropping off at payment selection, is it confusion about options, unexpected total price, security concerns, or something else? A good test plan acknowledges competing explanations and designs the test to distinguish between them.]

Section 3: Hypothesis

Write your hypothesis using this structure:

“We believe that [specific change] will [expected outcome] for [target users] because [evidence-based reason].”

Example hypotheses:

“We believe that showing shipping cost on the product page (rather than only at checkout) will reduce checkout abandonment by 10-15% for users who add items to cart, because user interviews show that unexpected shipping costs are the primary stated reason for cart abandonment.”

“We believe that replacing the 5-field registration form with a single email field plus progressive profiling will increase sign-up completion by 20% for new visitors, because session recordings show 40% of users abandon at the third form field.”

A good hypothesis is:

  • Specific about the change, the expected outcome, and the affected users
  • Measurable with a clear metric and expected direction
  • Falsifiable, meaning the test can prove it wrong
  • Grounded in evidence, not gut feeling

Section 4: Test variants

Control (A): [Describe the current experience exactly as it exists today] [Include screenshot or visual reference]

Variant (B): [Describe what changes in the variant] [Include screenshot or visual reference]

Changes between control and variant:

ElementControl (A)Variant (B)
[Element 1][Current state][Changed state]
[Element 2][Current state][Changed state]

Important: Isolate a single change per test whenever possible. When multiple elements change simultaneously, you cannot attribute results to any specific change. If you must test a larger redesign, treat it as a holistic test and plan follow-up tests to isolate individual variables.

For early-stage validation before committing to a full A/B test, concept testing and preference testing can help narrow down which variant directions are worth testing at scale.

Section 5: Success metrics

Primary metric (one only):

FieldValue
Metric name[e.g., checkout completion rate]
Definition[Exactly how it is calculated: numerator / denominator]
Expected directionIncrease / Decrease
Minimum detectable effect[Smallest meaningful change, e.g., +5% relative improvement]
Current baseline[Current value of this metric]

The primary metric is the single number that determines whether this test wins or loses. Pick one. If you cannot decide on one metric, you do not have a clear enough hypothesis.

Secondary metrics (2-3 maximum):

MetricExpected directionWhy monitoring
[e.g., Average order value][Increase / No change][Ensures shipping display does not reduce order size]
[e.g., Add-to-cart rate][No change][Confirms the change does not affect upstream behavior]

Secondary metrics provide context but do not determine the test outcome. They help you understand why the primary metric moved (or did not).

Guardrail metrics:

MetricAcceptable rangeWhy this is a guardrail
[e.g., Revenue per session]Must not decrease by more than 2%[Protects overall revenue even if conversion improves]
[e.g., Support ticket volume]Must not increase by more than 10%[Ensures the change does not create confusion]
[e.g., Page load time]Must not increase by more than 200ms[Protects performance]

Guardrail metrics protect against winning on your primary metric while damaging something else. A test that improves sign-up rate but doubles support tickets is not a win.

For a broader framework on selecting the right UX metrics for your experiments, see our complete metrics guide.

Section 6: Sample size and duration

Sample size calculation:

InputValue
Baseline conversion rate[Current control rate, e.g., 3.2%]
Minimum detectable effect[Relative %, e.g., 15% relative lift = 3.68% absolute]
Statistical significance level95% (alpha = 0.05)
Statistical power80% (beta = 0.20)
Required sample per variant[Calculated number]
Total required (all variants)[Sum across variants]

Use an online calculator (Evan Miller, Statsig, or Optimizely’s calculator) or your experimentation platform’s built-in tool. For guidance on sample sizing, see our research sample size guide.

Duration estimate:

FieldValue
Daily eligible traffic[Users/day reaching the tested area]
Traffic allocation[% of traffic in the experiment, e.g., 100%]
Traffic split[e.g., 50/50 between control and variant]
Estimated days to reach sample[Calculated based on traffic and required sample]
Maximum run duration[Cap at 4 weeks to avoid seasonality bias]

Traffic allocation:

  • Control (A): [50%]
  • Variant (B): [50%]

Rules:

  • Do not peek at results before reaching the required sample size
  • Do not stop the test early because one variant “looks like it is winning”
  • If sample size cannot be reached within 4 weeks, reconsider the minimum detectable effect or traffic allocation

Section 7: Implementation and QA

Where the test runs:

FieldValue
Page/flow[Specific URL or product flow]
Device scope[All devices / Desktop only / Mobile only]
Geographic scope[All regions / Specific markets]
Platform[Web / iOS / Android / All]

User eligibility: [Who is included: all users, new users only, logged-in users, specific segments]

Exclusion criteria: [Who is excluded: internal users, users in other active tests, specific holdout groups]

QA checklist:

  • Control experience verified as current production state
  • Variant experience matches design specification exactly
  • Analytics events firing correctly for both variants
  • Primary, secondary, and guardrail metrics all tracking
  • Exclusion logic implemented and tested
  • No conflict with other active tests on the same flow
  • Test visible on all targeted devices and browsers
  • Performance impact measured (page load, API response time)

Section 8: Analysis plan

Pre-committed analysis details:

FieldValue
Who analyzes[Name/team responsible]
Analysis date[Date based on reaching sample size, not on results]
Statistical method[Frequentist / Bayesian / Sequential testing]
Correction for multiple comparisons[Yes/No, method if yes]

Decision matrix:

ResultAction
Variant significantly improves primary metric AND guardrails are within rangeShip variant
Variant improves primary metric BUT guardrail is outside rangeDo not ship. Investigate guardrail impact
No statistically significant differenceDo not ship. Variant is not better than control
Variant significantly worsens primary metricDo not ship. Document learning
Test does not reach sample size within maximum durationDeclare inconclusive. Consider redesigning the test

Pre-committing to these decisions prevents the most common analysis mistake: moving the goalposts after seeing results.

Section 9: Results documentation

Complete this section after the test concludes.

Test duration: [Start date] to [End date] Samples collected: Control: [N] / Variant: [N]

Primary metric results:

ControlVariantRelative changep-valueSignificant?
[Primary metric][value][value][+/- %][p]Yes / No

Secondary metric results:

ControlVariantRelative changeSignificant?
[Metric 1][value][value][+/- %]Yes / No
[Metric 2][value][value][+/- %]Yes / No

Guardrail metric results:

ControlVariantChangeWithin range?
[Guardrail 1][value][value][+/- %]Yes / No

Decision: Ship variant / Do not ship / Extend test / Iterate

Key learnings: [What does this test reveal about user behavior or product performance, regardless of whether the variant won? Every test produces learning, even failures.]

Follow-up actions: [Next steps based on results: ship and monitor, design a follow-up test, conduct qualitative research to understand why results occurred]

How do you write a good A/B test hypothesis?

A strong hypothesis has four components: the change, the expected outcome, the target users, and the evidence-based reasoning.

Weak hypothesis: “Changing the button color will increase conversions.”

This is weak because it does not specify which button, what color, how much increase, for which users, or why you believe color matters.

Strong hypothesis: “We believe that changing the ‘Add to Cart’ button from gray to green on product detail pages will increase add-to-cart rate by 8-12% for mobile users, because heatmap data shows the current gray button has low visual contrast on mobile screens and receives 35% fewer taps than desktop.”

Ground hypotheses in real evidence: user research findings, analytics data, user feedback, or heuristic evaluation insights.

What are the most common A/B testing mistakes?

Peeking at results before reaching sample size

Checking results daily and stopping when they “look significant” inflates your false positive rate dramatically. A result that appears significant at day 3 may disappear by day 10. Pre-calculate sample size and commit to the analysis date.

Testing without a hypothesis

Running tests to “see what happens” produces noise, not insights. Without a hypothesis, you cannot distinguish signal from random variation, and you have no framework for interpreting unexpected results.

Changing multiple variables simultaneously

When you change the headline, the image, the button text, and the layout all at once, you cannot determine which change drove the result. Test one variable at a time unless you are deliberately running a holistic redesign test.

Choosing the wrong primary metric

Your primary metric must directly measure the outcome your hypothesis predicts. If your hypothesis is about reducing checkout abandonment, your primary metric should be checkout completion rate, not page views or time on page.

Ignoring guardrail metrics

A test that improves sign-up rate by 15% but increases 7-day churn by 20% is a net loss. Always define guardrail metrics that protect against unintended negative consequences.

When should you use A/B testing vs. other research methods?

A/B testing answers “which option performs better” but not “why.” Use it alongside other methods:

QuestionBest method
Which design performs better at scale?A/B testing
Why are users struggling with this flow?Usability testing
Which concept direction should we pursue?Concept testing
Which design do users prefer visually?Preference testing
What are users’ mental models and needs?User interviews
How do users interact with the prototype?Prototype testing
What patterns exist in current behavior?Product analytics

The strongest experiment programs use qualitative research to generate hypotheses, A/B testing to validate them, and post-test analysis to understand the results.

A/B testing plan checklist

Before launch

  • Hypothesis is specific, measurable, and grounded in evidence
  • Single variable isolated between control and variant
  • Primary metric defined with clear calculation method
  • Guardrail metrics defined with acceptable ranges
  • Sample size calculated and test duration estimated
  • Decision criteria pre-committed for all possible outcomes
  • QA completed on both variants across all targeted devices
  • No conflict with other active experiments

During the test

  • Do not peek at results before the planned analysis date
  • Monitor for technical issues (broken tracking, variant errors) without checking metric results
  • Document any external events that could affect results (marketing campaigns, outages, holidays)

After the test

  • Analyze at the pre-committed date, not when results look favorable
  • Apply the pre-committed decision criteria
  • Document results, learnings, and follow-up actions
  • Share findings with stakeholders regardless of outcome
  • Archive the completed plan for future reference

Frequently asked questions

How long should an A/B test run?

Until it reaches the pre-calculated sample size, with a maximum cap of 4 weeks. Running shorter risks false positives. Running longer introduces seasonality and external variable bias. If your traffic cannot reach the required sample in 4 weeks, increase the minimum detectable effect or allocate more traffic to the experiment.

What is a good minimum detectable effect?

It depends on the business impact. A 2% improvement on a high-traffic checkout page may represent millions in revenue. A 20% improvement on a low-traffic settings page may be negligible. Choose an MDE that represents the smallest change worth acting on for your specific context.

Can I run multiple A/B tests simultaneously?

Yes, as long as they target different parts of the product and user populations do not overlap. Running two tests on the same checkout flow simultaneously creates interaction effects that make both results unreliable. If tests must overlap, use a mutual exclusion framework.

What do I do when a test is inconclusive?

Document the learning (the change did not produce a detectable effect), consider whether the MDE was realistic, and decide whether to iterate on the variant or move to a different hypothesis. Inconclusive results are not failures. They tell you the change does not matter enough to detect, which is valuable information.

Should I A/B test everything?

No. A/B testing requires sufficient traffic and a measurable outcome. Use it for changes with clear success metrics and enough volume to reach statistical significance. For low-traffic pages, qualitative research methods like usability testing provide faster, richer insights.