This extend abstract is a contribution to the Online Symposium on Website Accessibility Metrics. The contents of this paper has not been developed by W3C Web Accessibility Initiative (WAI) and does not necessarily represent the consensus view of its membership.
Real-life websites often show less-than-perfect accessibility — even those that strive to be accessible. Some alt texts may be deficient, some enumerations may not use list elements, or some headings in an otherwise sensible hierarchy may not be not marked up correctly.
For some SC, binary pass/fail ratings make sense. For most SC, however, there is no discrete flip-over point at which a "pass" turns into a "fail". A graded rating scale can adress this problem.
The German BITV-Test, a web-based accessibility evaluation tool, demonstrates the rating approach. We focus mainly on page-level rating and will address only briefly the aggregation of page level ratings in the overall test score.
Difficulties exist on a number of levels:
A relevant rating approach must reflect the complexity of real-life web content. We argue that pass/fail ratings make evaluations less valid and less reliable. If a single failure instance can fail the SC, hardly any website would ever conform to WCAG. And a reasonably accessible web page with a few flaws would not have a better rating than a glaringly inaccessible page.
With a pass/fail rating, the evaluator must often choose to be too strict or too lenient. When rating a good page with some flaws, the alternative is failing the whole page (too strict), or ignoring the flaws (too lenient).
Different evaluators are likely to draw the line between pass and fail differently. With only two extremes to choose from, no amount of precision in the test procedure can ensure that individual evaluators will rate less-than-perfect content the same way.
To address this problem, BITV-Test applies a graded rating system. With more accurate page-level ratings, the aggregated score better reflects the overall accessibility of the site tested.
BITV-Test's 50 checkpoints map to WCAG level AA. Each checkpoint has a weight of 1, 2 or 3 points, depending on criticality. The checkpoint for SC 2.1.1 Keyboard, for example, contributes 3 points to the total score, that for SC 3.1.3 Language just 1 point. A page passing all 50 checkpoints fully would receive a total score of 100 points.
When testing a page per checkpoint, evaluators assess the total pattern or the set of instances and apply a graded Likert-type scale with five rating levels:
Ratings reflect both the frequency and criticality of flaws. For ratings other than a full "pass", a percentage of the weight is recorded. When evaluating alt texts, for instance, detecting a crucial image-based control without alternative text would lead to a "fail" rating (0%) while a single teaser image with inadequate alt text among many good ones would be rated as "marginally acceptable" (75%). Since this checkpoint weighs 3 points, the weight is reduced to 75%, i.e., 2.25 points. Page level rating values are aggregated over the entire page sample. At a total score of 90 points or more, the site is considered accessible.
What is the advantage of using a 5 point scale? In its initial form, BITV-Test used a 3 point scale: "pass" (100%), "partly acceptable" (50%), and "fail" (0 %). This meant that minor flaws often led to the overly harsh rating of "partly acceptable". Quite accessible sites with a number of such ratings often got a worse final score than sites with fewer, but very significant barriers. Introducing the 5 point scale made it easier to account for marginal flaws. These flaws would no longer 'sink' a site.
The reliability of an evaluation procedure can also be expressed as the degree of replicability. Would another evaluator arrive at the same result? The BITV conformance test is conducted as a tandem test: two evaluators independently complete a test based on the same page sample. In the following arbitration phase, they run through all checkpoints rated differently, and agree on the final rating. This has two benefits:
Our experience shows that the 5 point graded rating scale is quite reliable. Recently, a statistics function was added to record evaluators' individual ratings against final, quality assured, ratings. The offset offers a benchmark for the degree of qualification of each evaluator. Also, individual evaluators learn from their own rating errors. We plan to analyse and publish these statistics to demonstrate the inter-evaluator reliability of BITV-Test.
The BITV-Test rating scheme is based on page-level assessments. Faced with sites that cram many different states or entire processes onto one page, the question arises how the page-based rating scheme can be adequately modified to reflect these dynamic states.