AI product usability testing guide: hallucination handling, trust signals, and error recovery

How to run usability testing for AI products. Covers hallucination detection testing, trust signal evaluation, error recovery testing, session design with seeded errors, and AI-specific usability metrics for UX researchers.

AI product usability testing guide: hallucination handling, trust signals, and error recovery

Standard usability testing breaks when you apply it to AI products. The core assumption of traditional testing, that the software behaves consistently and the user is the variable, does not hold. AI products produce different outputs for the same input, generate confident responses that are factually wrong, and behave in ways that change over time as models are updated.

This guide is the tactical UXR playbook for running usability tests on AI products. It focuses on three testing areas that standard usability testing does not cover: how users handle hallucinations, how trust signals affect behavior, and whether users can recover from AI errors. These three areas determine whether an AI product succeeds or fails in production, and they require specific test designs, metrics, and analysis approaches.

For the broader research framework (trust calibration, mental model evolution, longitudinal research design), see our user research for AI products guide. For testing individual AI features within a product, see our AI feature testing guide.

Key takeaways

  • Every AI usability test session must include seeded errors. Testing only correct AI outputs produces misleading results because it does not reveal whether users can detect when the AI is wrong
  • Hallucination handling is not just a model problem. It is a UX problem. The interface must help users identify, question, and verify AI outputs, and testing must measure whether it succeeds
  • Trust signals (confidence scores, source citations, uncertainty indicators) change user behavior measurably. A/B test different signal types to find the combination that produces appropriate trust calibration
  • Error recovery testing is the highest-value component of AI usability testing because AI errors are expected, not exceptional. The question is not whether the AI will be wrong, but whether users can recover when it is
  • AI usability metrics differ from standard metrics. Add hallucination detection rate, trust calibration accuracy, and recovery success rate alongside traditional task completion and satisfaction measures

How to design AI usability test sessions

Session structure

AI usability sessions require a different structure than standard usability tests because you need to test user behavior with both correct and incorrect outputs.

Recommended session flow (45-60 minutes):

PhaseDurationPurposeWhat to capture
Warm-up and baseline5 minUnderstand participant’s AI experience, set expectationsPrior AI experience, baseline trust level (1-7 scale)
Correct output tasks15 minEstablish baseline usability with functioning AITask completion, time, satisfaction, initial trust formation
Error-seeded tasks15 minTest hallucination detection, error recovery, trust impactDetection rate, detection speed, recovery behavior, trust change
Trust signal comparison10 minA/B test confidence indicators, citations, uncertainty markersPreference, comprehension, behavioral change
Debrief interview10 minExplore trust reasoning, error detection strategies, overall experienceQualitative insights on trust formation and error handling

Seeding errors: the critical design decision

You must seed 25-35% of tasks with incorrect AI outputs. This ratio is important:

  • Too few errors (<15%): Users never encounter failures and your test produces artificially positive results
  • Too many errors (>40%): Users lose trust in the test environment itself and stop engaging naturally
  • The sweet spot (25-35%): Users encounter enough errors to reveal their detection strategies without losing trust in the session

Types of errors to seed:

Error typeExampleWhat it testsDifficulty to detect
Factual hallucinationAI states a false statistic or incorrect dateWhether users verify factual claimsMedium: requires domain knowledge
Plausible but wrong recommendationAI suggests an action that sounds reasonable but is incorrect for the contextWhether users evaluate recommendations criticallyHard: requires contextual judgment
Subtle data errorAI analysis has a calculation error embedded in an otherwise correct reportWhether users check AI-generated dataHard: requires careful reading
Confident wrong answerAI responds with high confidence to a question it cannot know the answer toWhether users calibrate trust based on confidence levelMedium: tests trust calibration
Outdated informationAI provides information that was correct 6 months ago but is now wrongWhether users consider timelinessEasy if they know the domain
Fabricated sourceAI cites a source that does not existWhether users verify citationsEasy if they check, hard if they trust

How to test hallucination handling

Hallucination handling testing measures whether your product’s interface helps users identify, question, and correct AI-generated content that is wrong.

What to test

Detection capability. Can users tell when the AI output is wrong? This depends on:

  • The user’s domain expertise (experts catch more errors)
  • The interface’s visual cues (confidence indicators, source links, uncertainty markers)
  • The output’s plausibility (plausible errors are harder to catch than obvious ones)

Verification behavior. When users suspect an error, what do they do?

  • Check the cited source (if sources are provided)
  • Cross-reference with another tool or their own knowledge
  • Ask the AI to re-generate or explain
  • Accept the output anyway because checking is too much effort

Correction workflow. When users confirm an error, can they:

  • Flag or report the hallucination
  • Edit the AI output directly
  • Request a new output
  • Revert to a manual workflow

Hallucination handling test protocol

Task 1: Baseline accuracy perception. Give users 5 AI outputs (4 correct, 1 with a factual error). Ask them to review each and rate their confidence. Do not tell them errors are present. Measure: detection rate, time to detection, confidence rating for the wrong output.

Task 2: Prompted verification. Give users 3 AI outputs and explicitly ask: “Are all of these correct? How would you verify?” Measure: verification strategy, tools used, time spent checking.

Task 3: Confidence-output mismatch. Show an AI output marked as “95% confident” that is actually wrong. Show another marked “60% confident” that is correct. Measure: does the confidence indicator help or mislead? Do users adjust behavior based on confidence levels?

Task 4: Citation verification. Show AI outputs with source citations. Include one output that cites a fabricated source. Measure: how many users click the citation link? Do they notice it does not exist or does not support the claim?

Hallucination handling metrics

MetricDefinitionTargetHow to calculate
Detection rate% of seeded errors that users identify>70% for domain experts, >40% for general usersDetected errors / total seeded errors
Detection latencyTime between seeing a wrong output and recognizing it<60 seconds for inline contentTimestamp of error display to first corrective action
False alarm rate% of correct outputs that users incorrectly flag as wrong<10%False flags / total correct outputs
Verification rate% of AI outputs users actively verify (check source, cross-reference)Varies by criticalityVerified outputs / total outputs
Correction success% of detected errors that users successfully correct or route around>85%Successful corrections / detected errors

How to test trust signals

Trust signals are interface elements that communicate the AI’s reliability: confidence scores, source citations, uncertainty indicators, and transparency about the AI’s limitations. Testing reveals which signals help users calibrate trust appropriately.

Trust signal types to test

Signal typeExampleHypothesisA/B test design
Confidence score”85% confident” badge on AI outputNumeric confidence helps users calibrate trustShow vs. hide confidence score. Measure acceptance rate and error detection for low-confidence outputs
Source citation”Based on: [linked document]“Citations increase verification and appropriate trustWith citations vs. without. Measure click-through rate and error detection
Uncertainty language”I’m not certain, but…” vs. “The answer is…”Hedging language reduces over-trustConfident phrasing vs. uncertain phrasing for the same outputs. Measure acceptance rate
Model transparency”Powered by GPT-4” or “AI-generated” labelsKnowing it is AI changes trust behaviorLabeled vs. unlabeled. Measure initial trust rating and verification behavior
Limitation disclosure”I can help with X but not Y” up frontSetting expectations reduces disappointmentWith vs. without limitation disclosure. Measure satisfaction after failure
Visual differentiationAI content in a different color, font, or containerVisual distinction prompts evaluationDifferentiated vs. blended AI content. Measure error detection rate
Feedback mechanismThumbs up/down on AI outputsFeedback controls give users agency and build trustWith vs. without feedback buttons. Measure trust score and engagement

Trust signal testing protocol

Phase 1: No signals (baseline). Present AI outputs with no trust indicators. Measure baseline acceptance rate, verification behavior, and error detection.

Phase 2: Single signal. Add one trust signal (e.g., confidence score). Re-run the same task types. Measure change in acceptance rate and error detection.

Phase 3: Combined signals. Add multiple signals (confidence + citation + feedback). Measure whether combined signals improve trust calibration or create information overload.

Phase 4: Signal failure. Show a high confidence score on a wrong output. Measure: does the confidence score cause users to miss the error (over-trust) or do they catch it anyway?

Trust calibration measurement

The goal is not high trust. It is appropriate trust. Measure the gap between user trust and AI accuracy:

  • User trust: Post-task trust rating (1-7 scale) for each AI output
  • AI accuracy: Ground truth accuracy for each output (known from your seeded design)
  • Calibration score: Correlation between user trust ratings and actual accuracy. Perfect calibration = users give high trust to correct outputs and low trust to incorrect ones

A product where users trust correct outputs at 6/7 and incorrect outputs at 2/7 has excellent calibration. A product where users rate everything at 5/7 regardless of accuracy has a calibration problem.

How to test error recovery

Error recovery testing measures whether users can complete their task after the AI fails. This is the highest-value component of AI usability testing because AI errors are not bugs to be eliminated. They are a permanent feature of probabilistic systems.

Error recovery test scenarios

ScenarioWhat breaksWhat recovery requiresWhat to measure
Wrong recommendation acted uponUser followed AI advice that was incorrectUndo the action, find the correct answer, re-do the taskTime to recover, task completion after error, frustration level
AI cannot complete the taskAI output is insufficient or refuses to answerFall back to manual workflow or alternative toolIs the fallback path clear? How much time does it add?
AI output degrades mid-taskAI works for first 3 steps, then produces garbage on step 4Salvage the partial work, complete the remaining task manuallyCan users identify where the AI went wrong? Do they lose their work?
Error compoundsAI makes a subtle error early that causes cascading problemsDetect the root error, undo the cascade, restart from the correct pointDetection of root cause, ability to trace back, rework effort
Repeated errorsAI makes the same mistake after the user corrects it onceFind a permanent workaround or disable the AI featureDoes the user give up on the AI? How many retries before abandonment?

Error recovery metrics

MetricDefinitionTarget
Recovery success rate% of error scenarios where users complete the task despite the AI error>75%
Recovery timeAdditional time needed to complete the task after encountering an error, compared to no-error baseline<2x the no-error time
Post-error trustTrust rating immediately after an error compared to pre-error baselineShould drop moderately (healthy skepticism), not collapse (abandon the product)
Post-error CSATSatisfaction rating for the overall experience after encountering errors4+/5 if recovery was smooth, <3/5 if recovery failed
Re-engagement rateDoes the user continue using the AI feature after the error or switch to manual?>60% continue using AI (with appropriate caution)
Error attributionDoes the user blame the AI, themselves, or the product?Healthy: “The AI got it wrong.” Unhealthy: “I must have done something wrong” or “This product is broken”

The post-error debrief

After error scenarios, conduct a focused debrief:

  1. “Did you notice anything wrong with the AI’s output?” (tests unprompted detection)
  2. “At what point did you realize something was off?” (maps detection timeline)
  3. “What did you do when you found the error?” (maps recovery strategy)
  4. “How did the error change your trust in the AI?” (measures trust impact)
  5. “Would you continue using this AI feature after this experience?” (measures retention risk)
  6. “What would have helped you catch the error sooner?” (design improvement input)

How to analyze AI usability test results

The three-layer analysis framework

Layer 1: Standard usability analysis. Task completion, time-on-task, errors, satisfaction. Report these as you would for any usability test. They remain the baseline.

Layer 2: AI-specific behavioral analysis. Acceptance patterns, verification behavior, trust trajectory, error detection. Analyze these per-participant and look for behavioral segments: over-trusters (accept everything), healthy skeptics (verify appropriately), and under-trusters (reject everything).

Layer 3: Trust calibration analysis. Map user trust ratings against actual AI accuracy for each output. Calculate calibration scores. Identify which trust signals improved calibration and which created over-trust or under-trust.

Reporting AI usability findings

Include these sections in your report beyond standard usability findings:

  • Hallucination detection summary. What % of errors were caught? Which error types were hardest to detect? What interface elements helped or hindered detection?
  • Trust signal effectiveness. Which signals improved calibration? Which created over-trust? Recommend the signal combination that produced the best calibration
  • Error recovery assessment. Can users complete tasks after errors? Where does recovery break down? What design changes would improve recovery?
  • Behavioral segments. What % of participants were over-trusters, healthy skeptics, and under-trusters? How do segments differ by AI experience level?
  • Risk assessment. Based on error detection rates and over-trust patterns, what is the risk of users acting on incorrect AI outputs in production?

Frequently asked questions

How is this different from the AI feature testing guide?

The AI feature testing guide covers testing specific AI features (search, suggestions, categorization, chatbots) from a PM perspective with a focus on acceptance rates and task flows. This guide covers the usability testing methodology itself from a UXR perspective, going deep on the three areas that make AI usability testing unique: hallucination handling, trust signals, and error recovery. Use the feature guide to decide what to test. Use this guide to design how to test it.

How many seeded errors should you include per session?

Two to four seeded errors across 8-12 tasks (25-35% error rate). Fewer than two does not give enough data points per participant. More than four risks overwhelming the session and breaking the participant’s engagement with the test environment. Vary error types across factual, plausible-but-wrong, and confident-but-incorrect to test different detection skills.

Do you need AI/ML expertise to run AI usability tests?

No, but you need a pre-study briefing with the data science team. Understand: what are the model’s known failure modes? What is the actual accuracy rate? What confidence thresholds does the model use? This information lets you design realistic error scenarios and interpret findings correctly. You do not need to understand how the model works technically, only where it tends to fail and how confident it is when it does.

How do you prevent participants from gaming the error detection tasks?

Do not tell participants that errors are seeded. Frame the test as standard usability testing: “We are testing this AI-powered feature. Please complete these tasks as you normally would.” If participants know you are testing error detection, they will scrutinize every output artificially, which does not reflect real usage. During the debrief, you can reveal the error seeding and discuss their detection experience.

How often should you re-run AI usability testing?

After every model update that changes output behavior, and quarterly at minimum for live products. AI products degrade differently than traditional software: accuracy drifts, user expectations evolve, and trust patterns shift. A quarterly “trust health check” using a subset of your original test scenarios and metrics keeps you ahead of calibration drift.