Important note: This Wiki page is edited by participants of the RDWG. It does not necessarily represent consensus and it may have incorrect information or information that is not supported by other Working Group participants, WAI, or W3C. It may also have some very useful information.

Benchmarking Web Accessibility Evaluation Tools

From Research and Development Working Group Wiki
Jump to: navigation, search

While the existing plethora of web accessibility evaluation tools [1] is evidence of an intense activity in the area, what is lacking and preventing actual progress is a quality evaluation framework that would allow effective benchmarking of those tools.


Page author(s): Simon Harper

Other contact(s): Chaals, Rui Lopes, Giorgio Brajnik, Markel Vigo, Yeliz Yesilada


Evaluation, Benchmarking, Metrics, Samples.


[Description of the research topic and challenges]


Such a framework would then

  1. support tool users in selecting the ones that are more suitable to them;
  2. support tool users to interpret tools’ outputs and estimate error rates; and
  3. help tools manufacturers to improve the quality of what they deliver.

Because deciding what “quality” means in this context is likely to be itself a questionable activity, we suggest an incremental, open and practical approach. It has already been followed, in past 2 decades, in Information Retrieval, with enormous success. The research community organized the TREC competitions [2] where information retrieval systems are tested against a common corpus of resources (documents, websites, etc.) and in order to find the algorithm that better performs according to certain rules, that operational, clearly set and open to evolve year after year.

To proceed analogously with accessibility evaluation tools, we suggest having “Web Accessibility Evaluation Tools Competitions” (WETCOM), open to research team and to manufacturers, run each year, and published as experimental research papers. Within each competition, appropriate test collections (of pages, sites, guidelines, etc.) are defined, as well as operational criteria to evaluate tools. To begin with, we could use measures of false positives and false negatives, and number of WCAG 2.0 criteria/techniques being covered such as [3]. At the end of each competition, feedback from participants will be collected regarding the test collections and evaluation criteria, and considered for next year’s competition.


  • How do we build test-collections?
  • Which criteria should be used to define them?
  • Can we make use of existing test collections?
  • How do we operationalize the notions of false positives and false negatives?
  • How do we rank the evaluation tools?
  • How do we balance ranking by taking into account criteria coverage vs. false results?
  • How to measure (the lack of) automation on criteria evaluation?
  • Which guidelines/standards must/should/could be included in the benchmarking framework (WCAG 2.0, WAI-ARIA, 508, ...) ? (bonus question: should we have a generic framework that would be instantiated into specific guidelines? )


  1. Complete list of accessibility evaluation tools
  2. Text REtrieval Conference (TREC)
  3. WCAG 2.0 BenToWeb Test Suite
  4. [DRAFT WCAG 2.0 Test Samples Repository]
  5. WAI Evaluation and Testing Activities
  6. [DRAFT WCAG 2.0 Evaluation Methodology Task Force (Eval TF) Work Statement]
  7. W3C Test Infrastructure framework
  8. G. Brajnik. The troubled path of accessibility engineering: an overview of traps to avoid and hurdles to overcome, Newsletter ACM SIGACCESS Accessibility and Computing, Issue 100, June 2011 (Get PDF)

Back to the list of topics.