This extend abstract is a contribution to the Online Symposium on Website Accessibility Metrics. The contents of this paper has not been developed by W3C Web Accessibility Initiative (WAI) and does not necessarily represent the consensus view of its membership.
Monitoring of web accessibility through regular benchmarks raises awareness and thus incites improvements of web sites. The eGovernment Monitoring (eGovMon) project  has collaborated with a group of Norwegian municipalities during more than two years, achieving encouraging results by a combination of evaluations and consultancy.
Initially, eGovMon used benchmarking tools based on the Unified Web Evaluation Methodology 1.2 (UWEM) , which describes conformance evaluations and large scale benchmarking for WCAG 1.0. However since WCAG 1.0 was superseded by WCAG 2.0, the eGovMon approach (including the implemented tests) had to be updated to accommodate the new guidelines.
This paper describes the requirements and challenges identified during the update of the metrics and reporting functions for WCAG 2.0 benchmarking. The core part of the reporting is the score function, which summarises the accessibility status of a web page or site into a single number.
First we look at the requirement perspective: What are the desirable properties of an accessibility score function? Then we take a WCAG 2.0 specific view with special consideration to the new properties of WCAG 2.0 (as compared to WCAG 1.0). Beyond that, the final section of the paper presents some ideas for developing a unified WCAG 2.0 score function, which would allow the comparison of WCAG 2.0 evaluations carried out by different tools.
Ideally not only results from different tools should be comparable; it is also desirable to obtain more insights in the comparability of expert evaluations and automated tests. However this topic is beyond the scope of this paper.
During the work on UWEM 1.2 a process for indicator requirement analysis was established. First all potential properties were collected and grouped according to the parts of the evaluation process. There are requirements for crawling and sampling, requirements addressing mathematical and statistical properties of the score, and requirements that describe how the score reflects certain features of the web content. Afterwards, a theoretical analysis investigated the dependencies and selected a set of non-conflicting properties for the score function. Finally, several suggested score functions were compared with regard to how well they meet the properties and the best candidate function was chosen. This process and its outcomes are described in the UWEM Indicator Refinement Report . Because of the good experience with this process, we applied it also to the case of WCAG 2.0. The structural differences between WCAG 1.0 and 2.0 make it necessary to revise some of the requirements. Section 3 looks into this aspect.
Vigo and Brajnik  analysed the desirable properties of web metrics for various application scenarios. They also present some quality attributes for benchmarking. The most relevant items for the score function are low sensitivity towards small changes in the web page and adequacy of scale and range of the score values.
The score function should be tailored to the structure of the test set (in this case WCAG 2.0). Therefore we start out the design of the score function with an analysis of WCAG 2.0.
The WCAG 1.0 tests (as defined in UWEM 1.2) are independent: failure of a test also means failure of a WCAG 1.0 Checkpoint. The structure of WCAG 2.0 is different. The Techniques with their detailed test procedure provide a natural starting point for the implementation of an evaluation tool. But in the presentation of results the dependencies of the Techniques must be taken into account. On the one hand there are Common Failures which directly cause the web content to fail a Success Criterion (SC). On the other hand, conclusions from Sufficient Techniques can only be drawn if the logical combinations  are considered. We suggest the following approach to derive an implementation of tests for a Success Criterion:
Moreover, some Techniques are used by several Success Criteria. For these reasons an interpretation of the results below the level of Success Criteria is not meaningful. Success Criteria are selected as the first level of aggregation.
This is a major difference from UWEM 1.2, which has an erratic number of tests per WCAG 1.0 Checkpoint. Each test contributes equally to the score result causing Checkpoints with many tests to be over-represented in the result. Using Success Criteria as intermediary aggregation level has several further advantages. The priority Level of the Success Criterion can be included in the score. The influence of Success Criteria with many Techniques is balanced. In an automated tool it becomes easy to highlight which Success Criteria need human judgement or were not tested.
Disadvantages of the approach are that the number of instances of a specific feature (such as form control) does not influence the score. If a tool does not implement all Techniques related to a SC, no conclusion can be drawn.
We suggest the following score function, which takes into account the above considerations: The SC-level result for page p is defined as one minus the ratio of instances where tests for Success Criterion c failed on page p. (In the formula f_c(p) denotes the number of instances where tests for c failed and n_c(p) denotes number of all instances where tests for c were applied.)
The tests are designed in such a way that each test results in either pass or fail. In some cases an additional warning message is returned to provide further information for the detailed accessibility report.
The page score S is calculated as the average of the SC-level page-results.
The page score can be refined by replacing the simple average with a weighted average which includes parameters to capture the priority of the Success Criteria.
The eGovMon project is developing an online checker for WCAG 2.0 that uses the new page score function. Initial experiments show that the results of the function meet the requirements and are understandable for the potential users of the checker. We have also started the work on a web site score function. This involves addressing the following open questions:
Finally, further testing of the checker tool is planned to ensure that the score function meets the main "soft requirement": the score value must make sense when presented to the users.
Large scale benchmarking of web accessibility often relies on tools due to resource limitations. Although there exist a number of tools which claim to check according to WCAG 2.0, their result are still not comparable.
This problem is mainly caused by the varying granularity of tests (Some tools implement several tests per Success Criterion while others only have one test.) and the differences in counting the instances (Some tools count every checked HTML element while others only count each instance once.). The tools also differ in how outcomes are grouped into categories like "error", "potential error", "warning". Some tools only report the absolute numbers for each outcome category, while some tools use some kind of score function. Presently, a generally accepted practice for reporting WCAG 2.0 evaluation results does not exist.
Another reason lies in the WCAG 2.0 documents themselves. Sometimes the aggregation of results from the Techniques is not documented well. Alonso et al.  describe the consequences of this challenge:
"This could lead to a situation where different evaluators use different aggregation strategies and thus produce different evaluation results."
A first simple step to increase the comparability of the results from different tools would be the introduction of aggregation on the level of Success Criteria, as suggested in this paper.
To define a truly unified WCAG 2.0 score and thus achieve actual inter-tool reliability—as demanded by Vigo and Brajnik —a dedicated collaboration between tool developers and researches would be necessary to address the following tasks:
For a start the validity of such an approach is ensured by the strong link with WCAG 2.0. Additional proof of the validity of automated testing could be established in an experiment similar to the one described by Casado Martínez et al. .
The eGovMon project is co-funded by the Research Council of Norway under the VERDIKT program. Project no.: VERDIKT 183392/S10.