User research for chatbot design: a complete guide for UX researchers
How to conduct user research for chatbot and conversational AI design. Includes a comparison table of chatbot testing methods, conversation flow testing, Wizard of Oz protocols, fallback analysis, and metrics for measuring chatbot UX quality.
Chatbots fail in ways that other software does not. A broken button produces the same error every time. A chatbot produces a different failure for every user, because every conversation takes a unique path through an unpredictable space of intents, phrasings, context switches, and edge cases.
That unpredictability makes chatbot user research fundamentally different from standard product research. You cannot define a fixed task flow and measure whether users complete it. You must observe how conversations unfold naturally, where they break down, how users recover (or do not), and whether the chatbot’s personality, tone, and response patterns match what users expect.
This guide covers how UX researchers study chatbot and conversational AI experiences, from early-stage Wizard of Oz testing through post-launch conversation log analysis. The centerpiece is a comparison of chatbot testing methods, because choosing the right method at each stage is the difference between a chatbot that handles real conversations and one that only works in demos.
For broader context on researching AI products (trust, explainability, mental models), see our user research for AI products guide. For testing individual AI features including chatbots, see our AI feature testing guide.
Key takeaways
- Chatbot research requires testing conversations, not tasks. Users do not follow linear paths through chatbot interactions, so your research must account for branching, context switching, and conversational dead ends
- Wizard of Oz testing is the highest-value method in early stages because it lets you test conversation design before building the NLU model, saving months of development on the wrong conversational flows
- The comparison table of methods below maps each testing approach to the chatbot development stage where it produces the most value, so you invest research effort at the right time
- Fallback and error recovery testing is more important than happy-path testing. Users forgive a chatbot that gracefully says “I don’t understand” far more than one that confidently gives a wrong answer
- Post-launch conversation log analysis is the only way to discover the conversation patterns, phrasings, and intents that real users bring that your design never anticipated
Comparison table of chatbot testing methods
This is the core reference. Each method maps to a development stage, participant requirement, and primary research question.
| Method | Development stage | Participants needed | What it tests | Primary metric | Time investment | Best for |
|---|---|---|---|---|---|---|
| Wizard of Oz | Pre-development, early concept | 8-10 per round | Conversation flow design, user expectations, phrasing patterns | Conversation completion rate, user phrasing diversity | 2-3 weeks including setup | Validating conversation design before building NLU |
| Scripted prototype testing | Early development (basic flows built) | 5-8 per round | Happy-path usability, intent mapping accuracy, UI interaction | Task success rate, time to resolution | 1-2 weeks | Testing structured flows (menus, buttons, guided paths) |
| Think-aloud usability sessions | Mid-development (functional prototype) | 5-8 per round | Real-time conversation quality, user reasoning, confusion points | Think-aloud friction moments, conversation abandonment points | 1-2 weeks | Understanding why users phrase things the way they do |
| Scenario-based testing | Mid to late development | 8-12 per round | Edge cases, multi-turn handling, context retention, topic switching | Error recovery success, context retention accuracy | 2 weeks | Testing robustness beyond happy paths |
| A/B conversation testing | Late development, live beta | 100+ per variant | Tone, personality, response length, explanation style preferences | Completion rate, CSAT, response acceptance rate | 2-4 weeks | Optimizing conversation style at scale |
| Diary studies | Post-launch or extended beta | 10-15 over 2-4 weeks | Long-term engagement, repeat visit patterns, trust evolution | Return frequency, conversation depth over time | 4-6 weeks | Understanding how chatbot relationships develop |
| Conversation log analysis | Post-launch | No participants (uses production logs) | Real user intents, unanticipated phrasings, drop-off patterns, NLU failures | Fallback rate, intent coverage, conversation completion | Ongoing | Discovering what users actually ask vs. what you designed for |
| Unmoderated testing | Late development, live product | 20-50 per round | Task completion at scale, basic usability, first-impression reactions | Task success rate, SUS score, time to first message | 1 week | Quick quantitative benchmarks between releases |
How to choose the right method
Before you build anything: Start with Wizard of Oz. It is the single most cost-effective chatbot research method because it prevents you from building NLU models for conversation flows that users do not need.
When you have a basic prototype: Run think-aloud usability sessions to understand how users naturally phrase their requests. This directly informs your NLU training data.
When the chatbot handles happy paths: Switch to scenario-based testing that deliberately pushes the chatbot into edge cases, context switches, and failure modes.
When you are live: Conversation log analysis becomes your primary research tool. It reveals the 40-60% of user intents and phrasings that your design never anticipated.
How to run Wizard of Oz testing for chatbots
Wizard of Oz (WoZ) testing is the most powerful early-stage chatbot research method. A human “wizard” sits behind the interface and responds to users in real time, simulating the chatbot’s behavior. Users believe they are talking to a bot.
Setup
The wizard needs:
- Access to the chat interface where they can type responses
- A response guide with tone, personality, vocabulary constraints, and key answers
- A decision tree for common intents (but freedom to handle unexpected ones)
- A way to introduce realistic delays (instant responses break the illusion; 1-3 second delays feel natural)
The participant sees:
- A standard chat interface (text input, message bubbles, bot avatar)
- No indication that a human is responding
What to capture
| Data point | How to capture | Why it matters |
|---|---|---|
| User’s exact phrasing for each intent | Record all messages | Becomes NLU training data. Real phrasings differ from what designers predict |
| Conversation branching points | Map each conversation as a flow diagram | Reveals the actual conversation structure users create vs. the one you designed |
| Wizard’s improvised responses | Log when the wizard goes off-script | Identifies intents and scenarios your design missed |
| Conversation breakdowns | Note when the wizard struggles to respond within the bot’s constraints | Reveals where the chatbot design has gaps |
| User satisfaction moments | Post-session interview: “When did the bot feel helpful vs. frustrating?” | Maps emotional experience to specific conversation moments |
WoZ protocol tips
- Run 8-10 sessions minimum. Chatbot conversations are highly variable, so you need more sessions than standard usability testing to see patterns
- Brief the wizard on personality constraints: response length, formality level, emoji usage, and what the bot should NOT attempt to do
- After each session, debrief with the wizard: “Where did you have to improvise? What did users ask that surprised you?”
- Record the wizard’s screen alongside the user’s screen so you can analyze response timing and decision-making
How to test conversation flows
Conversation flow testing goes beyond task completion to evaluate whether the chatbot maintains coherent, helpful conversations across multiple turns.
Multi-turn conversation testing
Design test scenarios that require 4-8 turns of conversation, not just a single question and answer. Multi-turn testing reveals:
- Context retention. Does the chatbot remember what the user said 3 turns ago? “I want to book a flight to London” followed by “Make it next Tuesday” followed by “Actually, Paris instead.” Can the chatbot track all three modifications?
- Topic switching. Can the user change topics mid-conversation and return to the original topic? “I was asking about my order. Actually, what are your return policies? OK, back to my order.”
- Clarification handling. When the chatbot asks a clarifying question, does the answer format it expects match how users naturally respond? If the bot asks “Which size?” and the user says “The big one,” does it handle the ambiguity?
- Conversation repair. When the conversation goes off track, can the user (or the bot) get it back on track? Test what happens when users say “That’s not what I meant” or “Start over.”
Fallback and error testing
Fallback testing is more important than happy-path testing. Design specific scenarios that trigger failures:
| Scenario | What it tests | What to measure |
|---|---|---|
| Query outside the bot’s scope | Does the bot acknowledge its limitation or hallucinate? | Hallucination rate, escalation clarity |
| Misspelled or slang input | NLU robustness for non-standard language | Intent recognition accuracy for informal input |
| Emotional or frustrated user | Does tone adaptation work? Does the bot recognize frustration? | Tone appropriateness, escalation trigger accuracy |
| Repeated question (user asks same thing twice) | Does the bot recognize repetition as confusion or provide the same answer? | Response variation, confusion detection |
| Ambiguous intent | How does the bot disambiguate? Does it ask the right clarifying question? | Disambiguation quality, user satisfaction with follow-up |
| Conversation dead end | What happens when the bot cannot help at all? | Escalation path clarity, human handoff quality |
Testing the human handoff
The transition from chatbot to human agent is one of the most critical and most neglected UX moments. Test:
- Trigger clarity. Can users request a human when they want one? Is the option visible?
- Context transfer. Does the human agent receive the conversation history? Does the user have to repeat themselves?
- Transition experience. How long does the handoff take? What does the user see during the wait? Is there a confirmation that a human is now connected?
- Post-handoff return. If the user comes back later, does the chatbot know they spoke with a human? Does it reference the resolution?
How to test chatbot personality and tone
Chatbot personality directly affects user trust, engagement, and willingness to return. Research must evaluate whether the designed personality matches user expectations.
Personality testing approach
A/B tone testing. Create 2-3 versions of the same conversation with different personality parameters:
| Personality dimension | Variant A | Variant B | What to measure |
|---|---|---|---|
| Formality | ”I’d be happy to help you with that." | "Sure thing! Let me look that up.” | Preference by context (support vs. sales vs. onboarding) |
| Verbosity | Brief responses (1-2 sentences) | Detailed responses (3-4 sentences with explanation) | Completion rate, satisfaction, perceived helpfulness |
| Empathy | Acknowledges frustration: “I understand that’s frustrating.” | Skips acknowledgment, goes straight to solution | CSAT difference, perceived warmth |
| Confidence | ”Based on your account, the answer is X." | "I think the answer might be X, but let me verify.” | Trust calibration, error tolerance |
| Humor | Light humor in responses | No humor, purely functional | Engagement for low-stakes vs. high-stakes conversations |
Post-conversation interview questions:
- “How would you describe the bot’s personality?”
- “Did the tone feel appropriate for what you were trying to do?”
- “Was there a moment where the bot’s response felt off, either too casual or too formal?”
- “Would you use this chatbot again? Why or why not?”
What metrics to track for chatbot UX
Conversation quality metrics
| Metric | Target | How to capture | What it reveals |
|---|---|---|---|
| Conversation completion rate | >80% | Percentage of conversations where user’s intent was resolved | Overall chatbot effectiveness |
| Fallback rate | <15% | Percentage of user messages that trigger a fallback (“I don’t understand”) | NLU coverage gaps |
| Escalation rate | <20% (depends on domain) | Percentage of conversations handed to human agents | Where the chatbot reaches its limits |
| Mean turns to resolution | <5 for simple intents | Average number of conversation turns to resolve an intent | Conversation efficiency |
| First response relevance | >85% | Percentage of first bot responses that match user intent | NLU accuracy and response mapping |
| Context retention accuracy | >90% | Percentage of multi-turn conversations where context is maintained | Technical conversation management |
| User satisfaction (CSAT) | 4+/5 | Post-conversation survey | Overall experience quality |
| Return rate | Increasing over time | Percentage of users who return to the chatbot voluntarily | Long-term trust and value perception |
Behavioral metrics (from conversation logs)
- Drop-off points. Where in the conversation do users abandon? Map the most common abandonment turns
- Rephrase rate. How often do users rephrase after a bot misunderstanding? High rephrase rates indicate NLU failures
- Input diversity. How many different ways do users express the same intent? This directly feeds NLU training data
- Profanity and frustration markers. Track aggressive language as a signal of conversation breakdown
How to recruit for chatbot research
Chatbot research participants need to be segmented by chatbot familiarity and use case context.
Segmentation
| Segment | Characteristics | Research value |
|---|---|---|
| Chatbot-naive users | Rarely or never use chatbots, prefer phone/email | Test baseline expectations, onboarding, and first-impression reactions |
| Chatbot-familiar users | Regularly use chatbots for support, shopping, or information | Test efficiency expectations, comparison to other bots, and power-user patterns |
| Domain experts (for vertical chatbots) | Deep knowledge in the chatbot’s domain (medical, legal, financial) | Test whether the chatbot’s domain knowledge meets expert expectations |
| Target persona users | Match the chatbot’s intended audience (customers, employees, patients) | Test real-world relevance and workflow fit |
Incentive benchmarks
| Session type | Rate | Notes |
|---|---|---|
| 30-minute moderated session | $75-125 | Standard for consumer chatbot testing |
| Wizard of Oz session (45 min) | $100-150 | Longer due to open-ended conversation |
| Diary study (2-4 weeks) | $150-250 total | Track repeated chatbot interactions over time |
| Domain expert session | $150-300/hr | Varies by domain (see our industry-specific guides) |
For general participant recruitment strategies, see our recruitment guide.
Frequently asked questions
How is chatbot research different from general AI product research?
Chatbot research focuses specifically on conversation quality: turn-by-turn flow, tone, personality, clarification handling, and human handoff. General AI product research covers broader trust, explainability, and mental model challenges across all AI product types. Chatbot research requires unique methods (Wizard of Oz, conversation log analysis, multi-turn scenario testing) that do not apply to non-conversational AI features.
How many test sessions do you need for chatbot research?
Eight to twelve for qualitative methods (WoZ, think-aloud usability) because conversation variability is high and you need more sessions to see patterns. For A/B tone testing, 100+ per variant for statistical significance. For conversation log analysis, no participant sessions are needed, but you need at least 1,000 conversations in your logs to identify meaningful patterns.
Should you test with the chatbot’s target use case or general tasks?
Target use case first, then expand. If your chatbot handles customer support, test with actual support scenarios before testing general conversation ability. Users who arrive with a real problem (order tracking, billing question, technical issue) interact with the chatbot differently than users given a generic prompt. Real scenarios produce more valid data.
When should you switch from Wizard of Oz to testing the actual chatbot?
When your NLU model can handle the top 20 intents that WoZ testing identified. Run a comparison: have the same scenarios tested with WoZ and with the actual chatbot. If the actual chatbot’s conversation completion rate is within 15% of the WoZ rate for those top intents, it is ready for direct testing. If the gap is larger, the NLU needs more training data before user testing adds value.
How do you test chatbots that use large language models (LLMs)?
LLM-based chatbots add two challenges: output variability (different responses to the same input) and hallucination risk. Test with the same approaches in this guide, plus: (1) run the same scenario 3-5 times to measure response consistency, (2) seed factual questions where you know the correct answer and measure hallucination rate, (3) test whether users can distinguish confident-and-correct from confident-and-wrong responses. See our AI feature testing guide for detailed error detection methods.