> WAI Web site redesign project
Note: This Web page contains rough notes for discussion and should not be referenced or quoted under any circumstances. This Web page is under development by the Education and Outreach Working Group (EOWG).
Late updated $Date: 2003/09/23 17:56:45 $ by Shawn Henry <shawn @w3.org>
The following information is provided by Shawn Henry in response to concerns over using a small number of usability test participants. Note that there are many more resources that cover this topic, including conference papers. I included only those easily found online.
We often receive initial concerns from clients regarding the appropriate number of participants for a given round of testing. As with most things, the answer is "it depends." Technically it sounds like the discussion you're having is between the merits of formative (sometimes called "diagnostic") v. summative (sometimes called "verification") testing. Formative testing is a quicker, broader test done to assess a system's largest issues. Summative testing is done to test subtle issues (typically after formative testing has been done and changes have been made) and generally involves a larger number of participants.
One deciding factor between them is the types of data you want out of the test. If you want to say "43.2% of users were able to locate the "Contact Us" information within 1.3 minutes" then summative testing is more appropriate. A typical finding from a formative test is more anecdotal, but no less valuable: "Participants had difficulty locating the organization's contact information. Many suggested the inclusion of a "contact us" link on the homepage.") The simplicity of my example here might not indicate that formative testing can provide you detailed, specific, subtle, or valuable findings, but it can.
We often employ summative techniques for comparative tests ("is Windows XP easier to use than Windows ME?") or to make legally defensible statistical, marketing, or government regulation claims ("This product is safe to use"). Summative evaluations involve generally larger numbers of participants- 30 is a typical minimum. Those tests do not have a moderator in the room with the participant, and the focus is much more on timing the tasks and specifically tracking everything the participant does to tackle a particular task. Based on our discussions, this would be overkill for what you're hoping to get out of this round of testing - an initial assessment of the website's core strengths and weaknesses, and a greater understanding of the information needs of your core audiences.
We can identify both desirable and undesirable website characteristics by watching people interact with your existing website. A sample of fewer than a dozen or so participants (our experiences indicate 8 people) is normally sufficient to produce useful findings, although the ideal number depends on several factors and is a topic of extensive (and vocal) debate among testing professionals. At the moment there is a lot of posturing going on out there, but I take a very practical, seasoned approach to the issue. Experimental data, and AIR's experiences running hundreds of usability studies annually, suggest that a test involving 8-10 participants will effectively spotlight a substantial number of the website's strengths and weaknesses.
Based on the development questions and core audiences we have discussed, that's my recommendation. For an initial, formative benchmarking test, 8-10 people who represent one, perhaps two, core audiences, should be sufficient to spotlight core strengths and weaknesses of your website's information architecture, visual aesthetic, ease of use, and use satisfaction. Once the initial enhancements suggested by the formative study have been made, then you may wish to conduct a more formal, broader, summative study.
Typically we encourage observers not to draw actionable conclusions from the first few test sessions, because test participant performance and opinions do vary. However, clear patterns tend to emerge after a modest number of test sessions, usually 5-8, which explains why formative tests may involve just a small number of test participants.
One factor we do look to when deciding how many participants to include in the study, is how many of each type of user we'll be involving to represent core audiences. For example, if we want to include three audiences (say, Web developers, Web team managers, and CEOs) then we would want to ensure enough people in each category are included to adequately represent each user population. However for this initial assessment, I recommend keeping the focus on one, perhaps two core audiences, so that we can get a sense of how your primary audience interacts with and perceives the website.
Resources:
Usability.gov has a very helpful website that provides a lot of information about usability testing and the merits of including usability in the development process. http://www.usability.gov/
They have a great section on deciding what type of test to conduct: http://www.usability.gov/methods/type_of_test.html
I find the Useit.com article you mentioned (http://useit.com/alertbox/20000319.html) to be a fun and informative explanation of the issue, and by and large I agree, with the caveat that for practical purposes, and based on our extensive experience conducting usability tests (literally thousands of individual sessions annually) we find that 8 is a good middle ground. By person 5 you've seen almost everything, but invariably someone in participants 5-8 comes through with something we were glad to see. That isn't to say participant 14 wouldn't be similarly helpful, but from a practical standpoint given the realities of testing costing time and money, we find that 8 is sufficient to catch 80-90% of a given product's most pressing issues. Then iterative testing enables us to make changes, and in a second round evaluate 1) the efficacy of the changes and 2) catch the new "most pressing" set of issues.
UCD and UT Resources includes a list of general resources on usability testing, provided by Justin Thorpe.