Interview craft

How Many Customer Interviews Are Enough?

June 29, 2026

TL;DR: Research puts data saturation at 9–24 interviews, depending on sample homogeneity and how deep you probe. Guest, Bunce & Johnson (2006) found 12 interviews enough for homogeneous samples. Hennink, Kaiser & Marconi (2017) found code saturation at around 9 and meaning saturation between 16 and 24. But these numbers assume disciplined, behavior-focused interviews. Eight disciplined conversations will beat two hundred leading ones.

The number everyone quotes

Two studies anchor most practitioner thinking on qualitative sample size.

Guest, Bunce, and Johnson (2006) analyzed 60 in-depth interviews conducted with women in Ghana and Nigeria. They found thematic saturation reached by interview 12 in homogeneous samples, and most core themes visible by interview 6.

Hennink, Kaiser, and Marconi (2017) went further by separating two distinct thresholds. Code saturation, the point where no new codes or categories emerge, arrived around interview 9. Meaning saturation, the point where no new depth of understanding emerges, took 16 to 24 interviews.

Nielsen Norman Group's guidance on qualitative sample sizes broadly aligns. Five interviews can surface most issues for a single, tightly-defined segment. Research spanning multiple distinct user groups needs proportionally more.

These numbers are real. They are also only half the picture.

Why the number is a trap

Both studies describe what happens when you run rigorous, non-leading, behavior-focused interviews with a consistent protocol across a homogeneous sample.

Most interviews don't do that.

A leading question doesn't generate new data. It confirms what you already believe. Ask someone "Do you find the onboarding process frustrating?" and you've invited agreement, not discovery. You can run that question 50 times and never reach saturation, because you were never collecting real signal. You were collecting mirrors.

The same problem shows up in softer forms. An interviewer who paraphrases a vague answer ("So it sounds like you want something faster?") has just planted the word "faster" where the respondent didn't say it. The next question builds on that planted frame. By the end of the interview, you have a transcript full of the interviewer's hypotheses wearing the respondent's voice.

Count-based saturation assumes the data is real. If it isn't, more interviews don't help. They compound the error.

Saturation assumes disciplined interviews

What makes an interview generate real data?

The Mom Test rules are the shortest answer. Ask about past behavior, not future intent. Ask about the last time something happened, not what someone would do hypothetically. Never put the answer inside the question. Treat enthusiasm as noise until a concrete behavior backs it.

These rules are harder to follow than they look. Not because the ideas are complex, but because the pull toward confirmation is constant. A respondent's excitement feels like validation. A vague positive answer feels like progress. The trained move (staying in behavior, following the flinch, probing without leading) runs against the grain of a normal conversation.

This is where most research breaks down: not at the count, but at the craft.

An interview reaches saturation when it produces no new codes. That only happens if each interview is generating codes in the first place. Codes come from behavior data, not from opinions or enthusiasm or hypotheticals.

If your interviews are disciplined (behavior-focused, non-leading, probing vague answers without planting new ones), then Guest et al. and Hennink et al. give you a reasonable target. For a homogeneous sample, plan for at least 12. For meaning saturation across a less uniform group, 16 to 24. For a narrow, tightly-defined segment, you may get there closer to 9.

If your interviews aren't disciplined, no sample size rescues the findings.

When fewer-but-deeper wins

Agencies often face this question early in a project: how many interviews can we fit in the budget?

The right answer is rarely "as few as possible." It's also not "as many as possible." It's: enough disciplined ones to reach saturation for this sample.

That framing changes the prioritization. If you have time and budget for 8 interviews, the craft of each one matters enormously. An 8-interview project with a rigorous, behavior-focused protocol will consistently outperform a 20-interview project with a leading guide and tired interviewers. The first project reaches saturation. The second circles the same hypothesis 20 times.

This is the case for fewer-but-deeper: if your protocol isn't disciplined, running more interviews doesn't fix the research. It produces more confident-sounding noise. That's worse than a smaller set of honest findings, because the larger dataset makes the bad data look authoritative.

This matters most for agencies billing discovery as a deliverable. A leading question is invisible in a transcript. It looks like research. Clients see interview counts, not interview quality. The discipline is what separates evidence from expensive theater, and it's not visible in the final report unless you know what to look for.

How volume changes the calculus

There is a legitimate case for running more interviews. Not to compensate for bad craft, but to cover genuinely distinct segments.

If your research spans three meaningfully different customer groups, you're running three saturation curves, not one. Guest et al.'s 12-interview threshold applies per homogeneous group. A project covering early adopters, mid-market buyers, and enterprise accounts may need 30 to 40 interviews to reach saturation across all three.

Volume also matters for validation. Once you've identified core jobs and forces from a small initial set, a second wave can test whether the patterns hold. This is how discovery becomes evidence stakeholders act on.

But the arithmetic only works if each interview is generating real data. Scale disciplined research and you reach saturation faster with more confidence. Scale sloppy research and you produce a large dataset full of the interviewer's assumptions.

The economics of this are real for agencies running discovery at volume. More interviews mean more hours, which means either higher project costs or thinner margins. The sample-size question is inseparable from the business question: how do you hold craft at scale when humans get tired, junior researchers start leading, and interview hours cap margin?

A compliment is not data. Neither is a leading question, a paraphrase, or a hypothetical. The count is just the number of times you ran the interview. Whether the data is real is a separate question.

Frequently asked questions

How many customer interviews do I need?

Research puts the range at 9 to 24, depending on sample homogeneity and depth. Guest, Bunce & Johnson (2006) found saturation at around 12 interviews for homogeneous samples. Hennink, Kaiser & Marconi (2017) found code saturation near 9 interviews and meaning saturation between 16 and 24. Both numbers assume disciplined, non-leading, behavior-focused interviews. Leading or shallow interviews don't reach saturation regardless of count. They confirm hypotheses rather than discover behavior.

What is data saturation?

Data saturation is the point where new interviews stop producing new codes or themes. Hennink et al. (2017) separate two levels: code saturation, where no new categories emerge (around 9 interviews), and meaning saturation, where no new depth of understanding emerges (16 to 24 interviews). Both depend on each interview generating real behavior data. A leading interview produces codes that were already in the guide, not new signal from the respondent.

When should I stop interviewing?

Stop when new interviews stop producing new findings. Track what you're learning between sessions. If the last two or three interviews have confirmed existing patterns without adding anything, you've likely reached saturation for this sample. The catch: this only works if your interviews are behavior-focused and non-leading. Leading questions produce false saturation early. The data stops changing, but only because it was never real signal to begin with.

If you run discovery for clients and want interviews that actually reach saturation, Lùc is in closed beta.