AI explainability research: methods that work
How UX researchers test whether users actually understand why an AI made a decision, and which methods surface the gaps that matter.
AI explainability research: methods that work
AI explainability research tests whether users genuinely understand why an AI system behaves the way it does. It is not about whether they can complete tasks. It is about whether their mental model of the AI is accurate enough to trust it appropriately, override it when needed, and recover when it fails.
This guide covers the methods that consistently surface comprehension failures in AI products, who each method works best for, and how to sequence them into a practical research plan.
Why explainability research is different from usability testing
When you run a standard usability test, you are looking for friction: broken flows, confusing labels, missed affordances. Users either complete the task or they do not.
AI explainability research adds a second question: do users understand why the AI produced that result? Two users can complete the same task with very different mental models. One user correctly understands that the AI ranked the candidates by inferred seniority signals. The other believes it ranked by application date. Both users clicked “accept” on the top candidate. Only one of them will catch the model’s systematic bias in week three.
This gap between task completion and genuine comprehension is what explainability research is built to find.
The field draws from three bodies of work: interpretable machine learning research from academics like Christoph Molnar{rel=“noopener”}, XAI (explainable AI) design guidelines from groups like the Partnership on AI{rel=“noopener”}, and usability research methods adapted for AI-specific failure modes. As a UX researcher, you are operating at the intersection of all three.
Six methods for AI explainability research
1. Think-aloud with counterfactual probes
Think-aloud protocols are the backbone of usability research, but they need modification for explainability work. Standard think-aloud captures what users notice. Counterfactual probes capture what users believe would change the AI’s output.
After the user interacts with an AI feature, ask:
- “What would have to be different for the AI to give you a different answer?”
- “If you changed X in the input, what do you think would happen to the result?”
- “What information do you think the AI used to make this decision?”
These questions force users to externalize their mental model. You will quickly see whether they understand the relevant input features, whether they attribute causality correctly, and whether they have absorbed the explanation UI (if any exists) or are ignoring it.
This method works well in sessions of 45 to 60 minutes, 8 to 10 participants, with sessions recorded for analysis. It pairs naturally with prototype testing and works at any stage of product development.
2. Mental model interviews (before and after exposure)
The goal of this method is to capture what users believe before interacting with the AI, expose them to the product, then capture how their mental model changed (or did not change).
Before exposure: Ask users to describe how they imagine the AI works. Probe for their theory of what data it uses, how it weights factors, and how confident it is. Do not correct misconceptions.
After exposure: Return to the same questions. What changed? What did the explanation UI communicate? What did users still get wrong?
The delta between the two interviews is the signal. If users’ mental models did not shift toward accuracy after using the product, the explanation design is failing. If they shifted in the wrong direction, something in the UI is actively misleading them.
This method is particularly valuable for AI features that involve visible confidence scores, probability estimates, or recommendation explanations, all common in healthcare, fintech, and enterprise analytics products.
3. Attribution elicitation tasks
Attribution elicitation inverts the usual research dynamic. Instead of asking users to complete a task with the AI, you show them an AI output and ask them to explain it back to you.
Give participants a realistic scenario: “The AI flagged this loan application as high risk. What do you think led to that?”
Then compare their explanation against the actual model inputs (or the explanation the product provided). Track:
- Which factors users correctly identified
- Which factors they invented (confabulation)
- Whether they cited the explanation UI or generated their own theory
- Whether their attributed factors would lead to appropriate trust or miscalibrated trust
Attribution tasks are fast (15 to 20 minutes each), can be run remotely with screen share, and produce directly actionable design findings. They are especially effective for AI features that display any form of explanation: feature importance charts, “why this recommendation” panels, or confidence indicators.
4. Trust calibration studies
Trust calibration research answers a specific question: does user confidence in the AI correlate with the AI’s actual accuracy? It is the most quantitatively rigorous method in explainability research and the most likely to surface overconfidence bias.
Protocol: Show participants a series of AI outputs. After each one, ask them to rate how confident they are that the output is correct (on a scale of 0 to 100%). Then reveal whether the AI was actually right or wrong.
Track the gap between user confidence and AI accuracy across conditions:
| Condition | Avg user confidence | AI accuracy | Calibration gap |
|---|---|---|---|
| With explanation UI | 78% | 65% | +13% (overconfident) |
| Without explanation UI | 61% | 65% | -4% (well-calibrated) |
| Expert users | 71% | 65% | +6% |
| Non-expert users | 83% | 65% | +18% |
A positive calibration gap (users more confident than the AI deserves) is the most dangerous failure mode. It predicts automation bias: users will stop checking the AI’s work because they believe it is more reliable than it is.
For statistical validity, trust calibration studies typically require 30 to 50 participants per condition. If your timeline only allows qualitative work, you can still run a lightweight version with 8 to 10 participants and treat the findings as directional.
5. Comprehension error taxonomy
After running multiple sessions, you will see comprehension failures cluster into recurring types. Building a taxonomy of these errors helps you prioritize which ones to fix first.
Common categories:
Scope errors: Users believe the AI accounts for factors it does not use. (“I assumed it was checking my calendar availability, but it doesn’t have access to that.”)
Confidence errors: Users treat probabilistic outputs as binary. (“The AI said there’s a 60% chance of churn, so I marked those accounts as churning.”)
Stability errors: Users believe the AI is more consistent than it is. (“I thought if I ran the same analysis twice I’d get the same result.”)
Attribution errors: Users credit or blame the wrong input feature. (“I thought it rejected the application because of the industry. Turns out it was the company size.”)
Each error type maps to a different design intervention: scope errors point to better input transparency, confidence errors point to improved probability communication, stability errors point to variance disclosure, attribution errors point to clearer feature importance displays.
6. Longitudinal mental model tracking
Single-session research can only show you the initial mental model. Longitudinal studies track how users’ understanding of the AI evolves over weeks of real use.
This method is resource-intensive but reveals patterns that single-session research misses: do users eventually calibrate correctly through trial and error? Do they develop accurate heuristics? Or do they drift toward dangerously confident misconceptions the longer they use the product?
For most teams, a lightweight version works well: diary study with 10 to 15 participants over 3 to 4 weeks, with a brief weekly probe on a specific AI decision they encountered. At the end, run a 30-minute debrief interview.
Longitudinal tracking is most valuable for AI products that users interact with daily, where automated decision support could silently degrade outcomes if users stop questioning it.
Choosing the right method for your stage
| Stage | Best method | Why |
|---|---|---|
| Early prototype | Think-aloud with counterfactual probes | Fast, flexible, no accuracy data needed |
| Explanation UI design | Attribution elicitation tasks | Directly tests whether the explanation works |
| Pre-launch validation | Trust calibration study | Catches automation bias before it ships |
| Post-launch monitoring | Longitudinal mental model tracking | Surfaces drift and real-world comprehension failure |
| Any stage | Mental model interviews | Establishes baseline, works with any fidelity |
Recruiting the right participants
Explainability research is highly sensitive to participant domain expertise. A power user with deep domain knowledge will develop workarounds and explanations that novice users never will. You need to recruit across the expertise spectrum intentionally.
For most AI products, recruit three distinct profiles:
- Domain experts with low AI familiarity: These users have the knowledge to evaluate AI outputs but may not have strong mental models of how AI works. They are likely to confabulate sophisticated-sounding explanations.
- AI-familiar users with low domain expertise: These users understand AI systems conceptually but may over-generalize from other AI products they have used.
- Novices across both dimensions: These users show you the full extent of miscalibration possible in your user base.
Platforms like CleverX, with an 8M+ verified panel spanning B2B professionals and consumer audiences across 150+ countries, make it possible to screen tightly for both dimensions simultaneously. For a healthcare AI product, for example, you might need clinicians who have never used AI diagnostic tools, versus data scientists who use AI daily but have no clinical background.
For more on recruiting the right mix for AI product research, see our guide to user research for AI products in 2026.
What to do with explainability research findings
Explainability research findings fall into two categories: comprehension failures (users do not understand the AI correctly) and calibration failures (users trust the AI at the wrong level).
Comprehension failures usually require explanation UI redesign: clearer labels, better input transparency, more concrete counterfactual examples. Attribution elicitation tasks and think-aloud sessions will tell you exactly what users think is driving the output, which shows you where the explanation currently misleads.
Calibration failures often require more systemic changes: better uncertainty communication, friction before high-stakes automated decisions, or confidence score redesign. Trust calibration studies will give you the data to make the case for these changes to product and engineering teams.
For teams working on features where AI outputs have high stakes (hiring, lending, clinical decision support), explainability research findings also feed directly into compliance and accountability documentation. The NIST AI Risk Management Framework{rel=“noopener”} explicitly lists “explainability and interpretability” as core trustworthiness properties, and UX research evidence is increasingly cited in AI governance reviews.
For a broader view of how to study AI-specific failure modes including hallucination perception, see researching AI hallucination perception with end users and how to validate AI-generated research insights.
If you are testing an AI feature end-to-end, the methods here work well alongside the five-step playbook in how to test AI features in your product.
Frequently asked questions
What is AI explainability research?
AI explainability research is a UX discipline that tests whether users can accurately understand, predict, and calibrate trust in an AI system’s outputs. It goes beyond usability testing to examine the mental models users form around why the AI behaves the way it does, not just whether they can complete tasks.
How is AI explainability research different from standard usability testing?
Standard usability testing focuses on task success and efficiency. AI explainability research specifically probes comprehension of AI reasoning: do users understand why a recommendation was made, what data drove a decision, and when to override the system? It often requires custom think-aloud probes and attribution tasks that do not appear in generic usability protocols.
Which methods work best for testing AI explainability?
The most effective methods are think-aloud protocols with counterfactual probes, mental model interviews before and after AI exposure, attribution elicitation tasks (asking users to explain the decision back to you), and trust calibration studies that compare user confidence against actual AI accuracy. Each method surfaces a different layer of comprehension failure.
How many participants do you need for AI explainability research?
For qualitative explainability studies, 8 to 12 participants per audience segment is sufficient to saturate comprehension failure patterns. If you are running trust calibration studies that require statistical comparison of user confidence versus accuracy, plan for at least 30 participants per condition.
What are the biggest pitfalls in AI explainability research?
The most common pitfalls are testing only expert users (who develop workarounds), ignoring domain expertise gaps between AI and the user, conflating task completion with genuine understanding, and treating post-task surveys as a proxy for mental model depth. Overconfidence bias is particularly hard to detect without calibration tasks.
How can CleverX help with AI explainability research?
CleverX provides access to an 8M+ verified panel across B2B and B2C audiences in 150+ countries, letting you recruit participants with specific domain expertise or AI product experience. For explainability studies that require nuanced probing, CleverX’s AI-moderated interview capability lets you run more sessions in parallel without sacrificing probe depth.