ompare qualitative and quantitative research: when to use each, pros, methods, sample sizes, timelines, and how to combine them for product decisions

AB tests validate UI changes with real user behavior: show variants, measure outcomes, and pick the design that improves key metrics and reduces risk.
AB testing, also called A/B or b testing, compares two versions of a user interface by showing each to different user groups and measuring performance against specific goals. This guide explains how to use A/B testing to optimize user interfaces. It is designed for designers, product managers, and UX researchers who want to make data-driven decisions to improve user experience and engagement. By following this guide, you’ll learn how to set up, run, and analyze A/B tests to improve your UI and achieve better outcomes for your users and business.
A/B testing is a method for comparing two versions of a design element to determine which one performs better based on user interactions.
A/B testing UI is the process of comparing two versions of a user interface element, such as a button or layout, to determine which one leads to higher engagement, conversions, or other desired outcomes.
A/B testing is a method for comparing two versions of a design element to determine which one performs better based on user interactions. A/B testing involves changing a single variable while keeping all other factors constant to isolate the effect of that change. This approach helps teams make informed decisions about which design changes will have the most positive impact on user experience.
A/B testing, also called A/B or b testing, compares two versions of a user interface by showing each to different user groups and measuring performance against specific goals. It optimizes UI elements like design, layout, colors, and buttons to enhance usability and user engagement. AB testing UX enables data-driven decisions by analyzing real user behavior and interactions with design variations.
Version A (control) is the current design; Version B (variant) is the proposed change. Users are split evenly between the two, and key metrics like conversion or engagement rates are measured.
Unlike prototype or lab tests, AB testing collects quantitative data from live user behavior, providing actionable insights for UI/UX improvements. For example, Dropbox tested signup button colors—green outperformed blue by 12%, leading to a data-driven design choice. AB testing is also widely used in digital marketing to boost engagement and conversions across various channels.
Research teams use UI AB tests as a key part of UX research to:
Validate design changes before full rollout
Choose between competing design directions
Optimize high-impact interfaces like checkout or signup
Settle debates with evidence instead of opinions
Measure cumulative impact of small improvements
Integrate AB testing into the UX design and design process for data-driven improvements
A/B testing helps organizations improve their offerings, leading to better user experience. It also provides clear, quantifiable results that can be easily communicated to stakeholders or team members.
A/B testing is most effective for mature products with sufficient traffic and clear goals. It helps optimize UI elements by leveraging user data to improve engagement and adapt to changing preferences. Use it as part of a broader research strategy for continuous improvement and competitive benchmarking.
High-traffic pages or features: Enough users to reach statistical significance quickly.
Example: Stripe continuously tests their checkout flow, where small improvements have big impacts.
Clear, measurable goals: Define specific KPIs like conversion rate or click-through rate.
Small, focused changes: Test individual elements like button colors or copy rather than complete redesigns.
Stable baselines: Mature products with consistent metrics.
Major redesigns: Too many simultaneous changes obscure which caused results.
Early-stage exploratory work: Better suited for qualitative feedback.
Low-traffic features: Insufficient data for meaningful results.
Branding or long-term perception changes: Short-term tests miss lasting effects.
Features requiring user learning: Short tests may penalize unfamiliar interfaces.
Example: Notion uses prototypes and gradual rollouts for major changes instead of AB testing.
Before starting the testing process, it is essential to define clear objectives to ensure your A/B testing UI efforts are focused and measurable.
In addition to A/B testing, consider multivariate testing as an alternative method. Multivariate testing allows you to compare multiple design variations simultaneously, helping you optimize user interface elements and validate design hypotheses.
Good tests start with clear setup.
Choosing the right testing tool is vital for conducting effective A/B tests. Make sure your A/B testing solution provides analytics that can track multiple metric types and connect to your data warehouse for deeper insights.
Identify the specific UI element or workflow you want to optimize.
Set clear, measurable goals for what you want to achieve (e.g., increase signups, reduce errors).
Built-in platform tools offer comprehensive features, visual editors, and statistical engines, making them suitable for regular testing programs; however, they can be expensive and require implementation.
Product analytics with experiments are integrated with analytics platforms, making them good for product teams already using these platforms, though they are less flexible than dedicated A/B testing tools.
Feature flag platforms are developer-friendly, support gradual rollouts, and provide good targeting capabilities; they require engineering resources and are best suited for engineering-heavy teams.
Examples of tools include Google Optimize (free), Optimizely, VWO, Adobe Target, Amplitude Experiment, Mixpanel Experiments, PostHog Experiments, LaunchDarkly, Split.io, and Statsig. When planning experiments, effective strategies to recruit participants for user research studies are essential to ensure reliable and actionable insights.
Google Optimize (free), Optimizely, VWO, Adobe Target
Amplitude Experiment, Mixpanel Experiments, PostHog Experiments
LaunchDarkly, Split.io, Statsig
State what you’re testing and why you think it’ll improve things.
Bad hypothesis: “New button will be better”
Good hypothesis: “Changing the signup button from ‘Submit’ to ‘Create My Account’ will increase signup completion by making the outcome clearer, reducing anxiety about commitment”
The good version explains what’s changing, what metric improves, and why. This helps you design the right test by guiding you to create variations and test specific design variations that address your hypothesis.
Pick one primary metric that indicates success. You can track secondary metrics too, but one should drive decisions.
Primary metrics examples:
Conversion rate (% who complete desired action)
Click-through rate (% who click on element)
Task completion rate (% who finish workflow)
Time to complete (how long task takes)
Error rate (% who make mistakes)
Sign ups (number of users who register or create an account)
Example: Calendly’s primary metric for signup flow tests: % of visitors who complete signup (sign ups). Secondary metrics: time to signup, fields left blank. For example, A/B testing the sign up button—its placement, color, or design—can directly impact sign ups and improve conversion rates.
You need enough users to confidently detect real differences from random noise and to ensure your results are statistically significant.
Use online calculators (Optimizely, VWO, Evan Miller’s calculator) to determine sample size based on:
Baseline conversion rate (current performance)
Minimum detectable effect (smallest improvement worth detecting)
Statistical significance level (typically 95%)
Statistical power (typically 80%)
A sufficiently large sample size is necessary to achieve statistically significant results and draw reliable conclusions from your AB testing UI experiments.
Example: If your signup converts at 10% and you want to detect a 2% improvement (to 12%), you need about 3,900 users per variation. That’s 7,800 total users.
Don’t start tests without calculating this. Running tests with insufficient sample size means you cannot obtain statistically significant findings, which wastes time and produces unreliable results.
Test one thing at a time when possible. Changing button color and copy simultaneously makes it unclear which drove results. AB testing UI typically involves comparing two or more variations, including a control version (the original design) and one or more variants, to see which performs best.
Exception: Sometimes you need to test multiple related changes together. A form redesign might change field order, labels, and help text simultaneously. That’s fine if you’re testing the complete form approach.
Keep context consistent. Don’t test button colors while also running major promotional campaigns that affect traffic. External factors confound results.
Document everything. Screenshot both versions. Note exactly what differs. Store your test data for later analysis and accurate comparison. You’ll need this for analysis and implementation.
Examples of what to test include:
Call to action buttons
Different images
Navigation layouts
Microcopy
Icon colors
Example: Linear tested two navigation layouts. Version A kept their sidebar navigation. Version B moved it to a top bar. They documented every UI change between versions for later reference.
Once setup is complete, running tests requires discipline. A/B testing, which is also known as split testing, involves comparing two or more versions of a UI element to determine which performs better. To ensure unbiased results, it is crucial to randomly divide users or participants into different groups, with each group experiencing a different variation. This random assignment helps achieve statistically significant and reliable insights.
Continuous testing allows teams to iteratively validate and refine user interfaces, leading to ongoing optimization of engagement and conversion rates.
50/50 splits are standard. Half see control, half see variant. Equal splits give fastest results. Tests can involve two or more versions of a design element, allowing you to compare multiple variations at once.
Exceptions: Sometimes you want conservative rollouts. 90/10 splits (90% control, 10% variant) test changes with less risk but take longer to reach significance.
Randomization matters. Users should be randomly assigned to variations. Most AB testing tools handle this automatically.
Consistency matters more. Once a user sees version A, they should keep seeing version A throughout the test. Switching mid-test confuses results.
Run tests until reaching statistical significance based on your sample size calculation. Don’t stop early because results look good.
Minimum duration: 1-2 weeks even if you hit sample size faster. This accounts for day-of-week effects. Monday users might behave differently than Saturday users.
Watch for novelty effects. Sometimes new designs perform better initially just because they’re different. This fades after users acclimate. Run tests long enough to see sustained effects.
Example: Figma runs UI tests for minimum two weeks regardless of statistical significance. They’ve seen too many “wins” that reversed after novelty wore off.
A/B testing UI is part of an iterative process—each test informs the next round of design and experimentation, driving continuous improvement.
Check for technical problems but don’t obsess over results.
Daily checks:
Both variations displaying correctly
Tracking firing properly
No error spikes
Traffic split maintaining 50/50
Monitoring real user behavior to ensure accurate test results
Don’t check statistical significance daily. This leads to stopping tests prematurely. Set it and forget it until the planned end date.
Do watch for disasters. If variant crashes, breaks functionality, or causes obvious problems, stop the test immediately.
Once your test completes, analyze systematically. Use statistical analysis to interpret the results, ensuring that your findings are significant and reliable. This approach supports evidence-based decision making, helping you move beyond assumptions and subjective opinions.
A/B testing provides data-driven insights and valuable insights that help optimize UI design, improve user experience, and drive better engagement.
Your AB testing tool calculates this, but understand what it means.
P-value < 0.05 means less than 5% chance results are random. This is standard significance threshold.
Confidence interval shows the range where the true difference likely falls. "Variant B increased conversion by 8-15% with 95% confidence" is more informative than "B won."
Don't declare winners prematurely. Reaching significance on day 3 of a planned 14-day test doesn't mean stopping early. Run the full duration.
Your variant might improve the primary metric but hurt others.
Example: Dropbox tested a more prominent upgrade button. It increased upgrade clicks (primary metric) but also increased confusion and support tickets (secondary metric), highlighting the importance of user research when evaluating UX changes. The secondary effect made them reconsider the change.
Check for: Market Research Applications: Strategic Guide
Unintended consequences (higher bounce rates, more errors, increased bounce rate)
Segment differences (works for new users but confuses existing ones)
Long-term effects (increased signups but lower retention)
Overall results might hide important patterns. Analyzing results for different user segments can uncover hidden trends that are missed in aggregate data.
Segment by:
New vs. returning users
Device type (mobile vs. desktop)
Traffic source (paid vs. organic)
User characteristics (plan type, company size)
Example: Notion tested a new onboarding flow. Overall, it decreased completion rates slightly. But for users from paid marketing campaigns (their most valuable traffic), it increased completion by 20%. They shipped it for paid traffic only.
Multiple comparisons problem: Testing 20 metrics increases chances of finding false positives. Stick to your predetermined primary metric.
Peeking at results: Checking significance repeatedly and stopping when you hit it creates false positives. Wait for planned completion.
Sample ratio mismatch: If your 50/50 split ends up 48/52, something's wrong with randomization. Investigate before trusting results.
Novelty effects: Initial lifts that fade over time. Week 1 shows improvement, week 3 shows regression to baseline.
Different UI elements benefit from different testing approaches. A/B testing various design elements—such as font sizes, colors, and layout choices—enables teams to analyze user behavior, user interactions, and user engagement. This data-driven approach helps optimize how users experience and respond to your interface, leading to better engagement and improved outcomes.
What to test:
Copy ("Sign Up" vs. "Get Started" vs. "Try Free")
Color (primary brand color vs. high-contrast alternatives)
Size (prominent vs. subtle)
Placement (above fold vs. below, left vs. right)
Example: Superhuman tested email archive button placement. Moving it from a dropdown menu to a prominent button increased archiving by 35%.
Testing tips:
Button tests usually reach significance quickly because every visitor sees buttons.
Good for learning AB testing mechanics.
What to test:
Number of fields (long vs. short forms)
Field labels (above vs. inline)
Required vs. optional fields
Multi-step vs. single-page
Placeholder text and help copy
Example: Stripe tested requiring billing address at signup vs. making it optional. Optional fields increased signups 15% but created downstream problems with fraud. They kept required fields despite the conversion hit.
Testing tips:
Form tests need higher sample sizes because only a subset of visitors interact with forms.
What to test:
Benefit-focused vs. feature-focused
Length (concise vs. detailed)
Tone (formal vs. casual)
Personalization (generic vs. tailored)
Example: Calendly tested homepage headlines. "Scheduling made easy" converted 8% worse than "Easy scheduling for professionals." The specificity about target audience mattered.
Testing tips:
Copy tests are quick wins.
Easy to implement, fast to test, meaningful impact.
What to test:
Single column vs. multi-column
White space and density
Image placement and size
Content order and hierarchy
Example: Linear tested spacing in their issue list. Tighter spacing (showing more issues per screen) decreased click-through rate because users couldn't scan effectively. More white space won.
Testing tips:
Layout tests often need longer duration because effects are subtle.
What to test:
Menu labels and grouping
Visibility (always visible vs. hidden)
Mega menus vs. simple dropdowns
Number of top-level items
To optimize these factors based on actual user behavior and preferences, consider incorporating insights from user research for product managers: a complete guide.
For more on usability, explore research-driven UX strategies.
Testing caution:
Navigation changes affect site-wide behavior.
Isolate specific pages when possible or accept longer test durations for site-wide changes.
Not all tests produce clear winners. Sometimes, AB testing UI changes can result in ambiguous outcomes, but these results still offer valuable user insights that can inform future design improvements and guide the next steps in your UX optimization process.
Sometimes neither version wins. This means:
Variations weren't different enough to matter
Your hypothesis was wrong
Sample size was too small
Metric wasn't sensitive to the change
What to do:
Ship whichever version is easier to maintain, or stick with control. Don't keep testing forever hoping for significance.
Example: Notion tested two onboarding copy variations. After 30,000 users, no significant difference appeared. They kept the control and moved on.
Your new design performed worse than the original. This happens often and it's fine.
What to do:
Keep the control. Learn why variant lost. Sometimes failed tests reveal insights more valuable than wins.
Example: Dropbox tested a minimalist signup form removing all explanatory text. It decreased signups 18%. They learned users needed context to understand value before committing.
Variant wins overall but loses for important segments, or vice versa.
What to do:
Consider targeted rollouts. Show different versions to different user types.
Variant shows massive improvement (50%+ lift) that seems unlikely.
What to do:
Double-check for bugs, technical issues, or external factors. Massive wins are rare. Usually something's wrong with the test.
After finding a winner, implement carefully.
Rolling out the winning UI variation can have a positive impact on key metrics and overall business outcomes, as A/B testing helps ensure that changes are data-driven and beneficial.
Don't immediately switch 100% of users. Roll out gradually:
Week 1: 25% of users
Week 2: 50% of users
Week 3: 100% of users
Monitor metrics during rollout. Sometimes test results don't replicate at scale.
Record what you tested, why, results, and decisions made.
Example: Linear maintains a testing wiki documenting every UI test. When debating similar changes later, they reference past tests avoiding repeated mistakes.
Winning variations often inspire followup tests. A better button might prompt testing better placement, better copy, or better surrounding design.
Example: Stripe's checkout optimization isn't one test. It's 100+ incremental tests building on each other.
Effective AB testing is a practice, not occasional experiments. It is essential for optimizing any digital product, from websites and apps to e-commerce platforms, and improving outcomes for website visitors by refining user experience and driving engagement.
To build a successful AB testing workflow, it’s important to consistently collect data at every stage of the process. This ensures you gather actionable insights from user interactions with different design variations, enabling informed decisions that lead to better results.
Don't randomly test whatever. Plan tests based on:
High-impact pages (checkout, signup, core features)
Strategic priorities (increasing trial conversions, reducing churn, recruiting participants for product research)
Example: Calendly plans quarterly testing roadmaps. Each quarter targets 10-12 tests on their highest-leverage pages, which are later organized and synthesized—such as through buyer personas—using techniques like affinity mapping in UX.
A/B testing shows what happens. User research explains why.
Workflow:
User research identifies friction points.
Design solutions to address issues.
A/B test validates which solution works best.
Follow-up research explains why winner succeeded.
Example: Figma doesn't AB test blindly. They first do usability testing identifying problems, design solutions, then AB test to pick winners.
Share:
Regular testing newsletters
Slack/email updates on completed tests
Public dashboards tracking win/loss ratios
Learnings that inform future design work
Not every test wins. Good teams win 30-40% of tests. That's healthy. Higher win rates suggest testing isn't ambitious enough.
There are several types of tools available for A/B testing, each suited to different needs. Built-in platform tools offer comprehensive features, including visual editors and statistical engines, making them suitable for regular testing programs. However, these tools can be expensive and may require significant implementation efforts. Product analytics platforms with experiment capabilities are integrated with existing analytics and are good for product teams who want to run experiments alongside their current analytics setup. These tools tend to be less flexible than dedicated A/B testing tools but are convenient for teams already using these platforms. Feature flag platforms are developer-friendly and support gradual rollouts and precise targeting capabilities. They often require engineering resources to implement and manage, making them ideal for engineering-heavy teams.
Example: Notion uses LaunchDarkly for gradual feature rollouts combined with their own analytics for measuring impact.
If you've never AB tested:
Test 1: Pick a high-traffic page (homepage, signup). Test button copy. This teaches mechanics with fast results.
Test 2: Test something based on user feedback or support tickets. This teaches using qualitative input for test ideas.
Test 3: Test a design change you're considering anyway. This teaches making product decisions with data.
After 3-5 tests, you'll understand the rhythm and can scale up.
Example: Webflow started with homepage button tests. After seeing data-driven decisions work, they built a full testing program running 2-3 tests monthly.
A/B testing isn't magic. It's a tool for validating design decisions with real behavior.
It won't tell you what to build. It won't replace design judgment. It won't solve bad product-market fit. For more on effective strategies, explore our market research resources.
It will help you optimize within an established direction, settle design debates with evidence, and incrementally improve user experience.
Used well, AB testing makes products measurably better. Used poorly, it creates false confidence in bad decisions or endless optimization of irrelevant details.
Test things that matter. Accept that many tests fail. Learn from everything. That's how you build great products.
Access identity-verified professionals for surveys, interviews, and usability tests. No waiting. No guesswork. Just real B2B insights - fast.
Book a demoJoin paid research studies across product, UX, tech, and marketing. Flexible, remote, and designed for working professionals.
Sign up as an expert