AI product usability testing guide: hallucination handling, trust signals, and error recovery
How to run usability testing for AI products. Covers hallucination detection testing, trust signal evaluation, error recovery testing, session design with seeded errors, and AI-specific usability metrics for UX researchers.
Standard usability testing breaks when you apply it to AI products. The core assumption of traditional testing, that the software behaves consistently and the user is the variable, does not hold. AI products produce different outputs for the same input, generate confident responses that are factually wrong, and behave in ways that change over time as models are updated.
This guide is the tactical UXR playbook for running usability tests on AI products. It focuses on three testing areas that standard usability testing does not cover: how users handle hallucinations, how trust signals affect behavior, and whether users can recover from AI errors. These three areas determine whether an AI product succeeds or fails in production, and they require specific test designs, metrics, and analysis approaches.
For the broader research framework (trust calibration, mental model evolution, longitudinal research design), see our user research for AI products guide. For testing individual AI features within a product, see our AI feature testing guide.
Key takeaways
- Every AI usability test session must include seeded errors. Testing only correct AI outputs produces misleading results because it does not reveal whether users can detect when the AI is wrong
- Hallucination handling is not just a model problem. It is a UX problem. The interface must help users identify, question, and verify AI outputs, and testing must measure whether it succeeds
- Trust signals (confidence scores, source citations, uncertainty indicators) change user behavior measurably. A/B test different signal types to find the combination that produces appropriate trust calibration
- Error recovery testing is the highest-value component of AI usability testing because AI errors are expected, not exceptional. The question is not whether the AI will be wrong, but whether users can recover when it is
- AI usability metrics differ from standard metrics. Add hallucination detection rate, trust calibration accuracy, and recovery success rate alongside traditional task completion and satisfaction measures
How to design AI usability test sessions
Session structure
AI usability sessions require a different structure than standard usability tests because you need to test user behavior with both correct and incorrect outputs.
Recommended session flow (45-60 minutes):
| Phase | Duration | Purpose | What to capture |
|---|---|---|---|
| Warm-up and baseline | 5 min | Understand participant’s AI experience, set expectations | Prior AI experience, baseline trust level (1-7 scale) |
| Correct output tasks | 15 min | Establish baseline usability with functioning AI | Task completion, time, satisfaction, initial trust formation |
| Error-seeded tasks | 15 min | Test hallucination detection, error recovery, trust impact | Detection rate, detection speed, recovery behavior, trust change |
| Trust signal comparison | 10 min | A/B test confidence indicators, citations, uncertainty markers | Preference, comprehension, behavioral change |
| Debrief interview | 10 min | Explore trust reasoning, error detection strategies, overall experience | Qualitative insights on trust formation and error handling |
Seeding errors: the critical design decision
You must seed 25-35% of tasks with incorrect AI outputs. This ratio is important:
- Too few errors (<15%): Users never encounter failures and your test produces artificially positive results
- Too many errors (>40%): Users lose trust in the test environment itself and stop engaging naturally
- The sweet spot (25-35%): Users encounter enough errors to reveal their detection strategies without losing trust in the session
Types of errors to seed:
| Error type | Example | What it tests | Difficulty to detect |
|---|---|---|---|
| Factual hallucination | AI states a false statistic or incorrect date | Whether users verify factual claims | Medium: requires domain knowledge |
| Plausible but wrong recommendation | AI suggests an action that sounds reasonable but is incorrect for the context | Whether users evaluate recommendations critically | Hard: requires contextual judgment |
| Subtle data error | AI analysis has a calculation error embedded in an otherwise correct report | Whether users check AI-generated data | Hard: requires careful reading |
| Confident wrong answer | AI responds with high confidence to a question it cannot know the answer to | Whether users calibrate trust based on confidence level | Medium: tests trust calibration |
| Outdated information | AI provides information that was correct 6 months ago but is now wrong | Whether users consider timeliness | Easy if they know the domain |
| Fabricated source | AI cites a source that does not exist | Whether users verify citations | Easy if they check, hard if they trust |
How to test hallucination handling
Hallucination handling testing measures whether your product’s interface helps users identify, question, and correct AI-generated content that is wrong.
What to test
Detection capability. Can users tell when the AI output is wrong? This depends on:
- The user’s domain expertise (experts catch more errors)
- The interface’s visual cues (confidence indicators, source links, uncertainty markers)
- The output’s plausibility (plausible errors are harder to catch than obvious ones)
Verification behavior. When users suspect an error, what do they do?
- Check the cited source (if sources are provided)
- Cross-reference with another tool or their own knowledge
- Ask the AI to re-generate or explain
- Accept the output anyway because checking is too much effort
Correction workflow. When users confirm an error, can they:
- Flag or report the hallucination
- Edit the AI output directly
- Request a new output
- Revert to a manual workflow
Hallucination handling test protocol
Task 1: Baseline accuracy perception. Give users 5 AI outputs (4 correct, 1 with a factual error). Ask them to review each and rate their confidence. Do not tell them errors are present. Measure: detection rate, time to detection, confidence rating for the wrong output.
Task 2: Prompted verification. Give users 3 AI outputs and explicitly ask: “Are all of these correct? How would you verify?” Measure: verification strategy, tools used, time spent checking.
Task 3: Confidence-output mismatch. Show an AI output marked as “95% confident” that is actually wrong. Show another marked “60% confident” that is correct. Measure: does the confidence indicator help or mislead? Do users adjust behavior based on confidence levels?
Task 4: Citation verification. Show AI outputs with source citations. Include one output that cites a fabricated source. Measure: how many users click the citation link? Do they notice it does not exist or does not support the claim?
Hallucination handling metrics
| Metric | Definition | Target | How to calculate |
|---|---|---|---|
| Detection rate | % of seeded errors that users identify | >70% for domain experts, >40% for general users | Detected errors / total seeded errors |
| Detection latency | Time between seeing a wrong output and recognizing it | <60 seconds for inline content | Timestamp of error display to first corrective action |
| False alarm rate | % of correct outputs that users incorrectly flag as wrong | <10% | False flags / total correct outputs |
| Verification rate | % of AI outputs users actively verify (check source, cross-reference) | Varies by criticality | Verified outputs / total outputs |
| Correction success | % of detected errors that users successfully correct or route around | >85% | Successful corrections / detected errors |
How to test trust signals
Trust signals are interface elements that communicate the AI’s reliability: confidence scores, source citations, uncertainty indicators, and transparency about the AI’s limitations. Testing reveals which signals help users calibrate trust appropriately.
Trust signal types to test
| Signal type | Example | Hypothesis | A/B test design |
|---|---|---|---|
| Confidence score | ”85% confident” badge on AI output | Numeric confidence helps users calibrate trust | Show vs. hide confidence score. Measure acceptance rate and error detection for low-confidence outputs |
| Source citation | ”Based on: [linked document]“ | Citations increase verification and appropriate trust | With citations vs. without. Measure click-through rate and error detection |
| Uncertainty language | ”I’m not certain, but…” vs. “The answer is…” | Hedging language reduces over-trust | Confident phrasing vs. uncertain phrasing for the same outputs. Measure acceptance rate |
| Model transparency | ”Powered by GPT-4” or “AI-generated” labels | Knowing it is AI changes trust behavior | Labeled vs. unlabeled. Measure initial trust rating and verification behavior |
| Limitation disclosure | ”I can help with X but not Y” up front | Setting expectations reduces disappointment | With vs. without limitation disclosure. Measure satisfaction after failure |
| Visual differentiation | AI content in a different color, font, or container | Visual distinction prompts evaluation | Differentiated vs. blended AI content. Measure error detection rate |
| Feedback mechanism | Thumbs up/down on AI outputs | Feedback controls give users agency and build trust | With vs. without feedback buttons. Measure trust score and engagement |
Trust signal testing protocol
Phase 1: No signals (baseline). Present AI outputs with no trust indicators. Measure baseline acceptance rate, verification behavior, and error detection.
Phase 2: Single signal. Add one trust signal (e.g., confidence score). Re-run the same task types. Measure change in acceptance rate and error detection.
Phase 3: Combined signals. Add multiple signals (confidence + citation + feedback). Measure whether combined signals improve trust calibration or create information overload.
Phase 4: Signal failure. Show a high confidence score on a wrong output. Measure: does the confidence score cause users to miss the error (over-trust) or do they catch it anyway?
Trust calibration measurement
The goal is not high trust. It is appropriate trust. Measure the gap between user trust and AI accuracy:
- User trust: Post-task trust rating (1-7 scale) for each AI output
- AI accuracy: Ground truth accuracy for each output (known from your seeded design)
- Calibration score: Correlation between user trust ratings and actual accuracy. Perfect calibration = users give high trust to correct outputs and low trust to incorrect ones
A product where users trust correct outputs at 6/7 and incorrect outputs at 2/7 has excellent calibration. A product where users rate everything at 5/7 regardless of accuracy has a calibration problem.
How to test error recovery
Error recovery testing measures whether users can complete their task after the AI fails. This is the highest-value component of AI usability testing because AI errors are not bugs to be eliminated. They are a permanent feature of probabilistic systems.
Error recovery test scenarios
| Scenario | What breaks | What recovery requires | What to measure |
|---|---|---|---|
| Wrong recommendation acted upon | User followed AI advice that was incorrect | Undo the action, find the correct answer, re-do the task | Time to recover, task completion after error, frustration level |
| AI cannot complete the task | AI output is insufficient or refuses to answer | Fall back to manual workflow or alternative tool | Is the fallback path clear? How much time does it add? |
| AI output degrades mid-task | AI works for first 3 steps, then produces garbage on step 4 | Salvage the partial work, complete the remaining task manually | Can users identify where the AI went wrong? Do they lose their work? |
| Error compounds | AI makes a subtle error early that causes cascading problems | Detect the root error, undo the cascade, restart from the correct point | Detection of root cause, ability to trace back, rework effort |
| Repeated errors | AI makes the same mistake after the user corrects it once | Find a permanent workaround or disable the AI feature | Does the user give up on the AI? How many retries before abandonment? |
Error recovery metrics
| Metric | Definition | Target |
|---|---|---|
| Recovery success rate | % of error scenarios where users complete the task despite the AI error | >75% |
| Recovery time | Additional time needed to complete the task after encountering an error, compared to no-error baseline | <2x the no-error time |
| Post-error trust | Trust rating immediately after an error compared to pre-error baseline | Should drop moderately (healthy skepticism), not collapse (abandon the product) |
| Post-error CSAT | Satisfaction rating for the overall experience after encountering errors | 4+/5 if recovery was smooth, <3/5 if recovery failed |
| Re-engagement rate | Does the user continue using the AI feature after the error or switch to manual? | >60% continue using AI (with appropriate caution) |
| Error attribution | Does the user blame the AI, themselves, or the product? | Healthy: “The AI got it wrong.” Unhealthy: “I must have done something wrong” or “This product is broken” |
The post-error debrief
After error scenarios, conduct a focused debrief:
- “Did you notice anything wrong with the AI’s output?” (tests unprompted detection)
- “At what point did you realize something was off?” (maps detection timeline)
- “What did you do when you found the error?” (maps recovery strategy)
- “How did the error change your trust in the AI?” (measures trust impact)
- “Would you continue using this AI feature after this experience?” (measures retention risk)
- “What would have helped you catch the error sooner?” (design improvement input)
How to analyze AI usability test results
The three-layer analysis framework
Layer 1: Standard usability analysis. Task completion, time-on-task, errors, satisfaction. Report these as you would for any usability test. They remain the baseline.
Layer 2: AI-specific behavioral analysis. Acceptance patterns, verification behavior, trust trajectory, error detection. Analyze these per-participant and look for behavioral segments: over-trusters (accept everything), healthy skeptics (verify appropriately), and under-trusters (reject everything).
Layer 3: Trust calibration analysis. Map user trust ratings against actual AI accuracy for each output. Calculate calibration scores. Identify which trust signals improved calibration and which created over-trust or under-trust.
Reporting AI usability findings
Include these sections in your report beyond standard usability findings:
- Hallucination detection summary. What % of errors were caught? Which error types were hardest to detect? What interface elements helped or hindered detection?
- Trust signal effectiveness. Which signals improved calibration? Which created over-trust? Recommend the signal combination that produced the best calibration
- Error recovery assessment. Can users complete tasks after errors? Where does recovery break down? What design changes would improve recovery?
- Behavioral segments. What % of participants were over-trusters, healthy skeptics, and under-trusters? How do segments differ by AI experience level?
- Risk assessment. Based on error detection rates and over-trust patterns, what is the risk of users acting on incorrect AI outputs in production?
Frequently asked questions
How is this different from the AI feature testing guide?
The AI feature testing guide covers testing specific AI features (search, suggestions, categorization, chatbots) from a PM perspective with a focus on acceptance rates and task flows. This guide covers the usability testing methodology itself from a UXR perspective, going deep on the three areas that make AI usability testing unique: hallucination handling, trust signals, and error recovery. Use the feature guide to decide what to test. Use this guide to design how to test it.
How many seeded errors should you include per session?
Two to four seeded errors across 8-12 tasks (25-35% error rate). Fewer than two does not give enough data points per participant. More than four risks overwhelming the session and breaking the participant’s engagement with the test environment. Vary error types across factual, plausible-but-wrong, and confident-but-incorrect to test different detection skills.
Do you need AI/ML expertise to run AI usability tests?
No, but you need a pre-study briefing with the data science team. Understand: what are the model’s known failure modes? What is the actual accuracy rate? What confidence thresholds does the model use? This information lets you design realistic error scenarios and interpret findings correctly. You do not need to understand how the model works technically, only where it tends to fail and how confident it is when it does.
How do you prevent participants from gaming the error detection tasks?
Do not tell participants that errors are seeded. Frame the test as standard usability testing: “We are testing this AI-powered feature. Please complete these tasks as you normally would.” If participants know you are testing error detection, they will scrutinize every output artificially, which does not reflect real usage. During the debrief, you can reveal the error seeding and discuss their detection experience.
How often should you re-run AI usability testing?
After every model update that changes output behavior, and quarterly at minimum for live products. AI products degrade differently than traditional software: accuracy drifts, user expectations evolve, and trust patterns shift. A quarterly “trust health check” using a subset of your original test scenarios and metrics keeps you ahead of calibration drift.