This paper is a contribution to the Website Accessibility Metrics Symposium. It was not developed by the W3C Web Accessibility Initiative (WAI) and does not necessarily represent the consensus view of W3C staff, participants, or members.
Attaining Metric Validity and Reliability with the Web Accessibility Quantitative Metric
1. Problem Addressed
The Web Accessibility Quantitative Metric (hereafter, WAQM) is designed to alleviate the need for measuring accessibility conformance in a more precise way than the qualitative one proposed by WCAG 1.0 (0, A, AA, AAA). The WAQM is an automatic metric, which means that human intervention is not required to be computed. Since it solely relies on reports produced by accessibility testing tools it is designed in such a way that tries to take the maximum advantage of evaluation report data.
In the early 2000's, Web quality frameworks were proposed to establish the criteria that websites should satisfy in order to guarantee their quality (see Mich et al. (2003) and Olsina and Rossi (2002)). Web accessibility was considered in these models although no clear means or methods were provided to quantify it. Some of the pioneering work was put forward by González et al. (2003) following the previously mentioned quality models and metrics. However, one of the first accessibility metric was the failure-rate by Sullivan and Matson (2000). To some extent, this intuitive metric has been borrowed by most metrics that were later proposed.
The goal of WAQM is not only to yield numeric scores but also to take into consideration the following aspects:
- In order to measure conformance in percentage terms, the scores yielded by the metric have to be normalized. To do so, the ratio between potential failure-points and actual violations is computed for all checkpoints that can be tested automatically, that is, the failure-rate. Potential failure points tend to be very numerous compared to actual errors leading to low failure-rates and high accessibility scores. As a side effect, Web pages got similar accessibility scores, making difficult to distinguish them. Each failure-rate is spread out to get a more effective discriminative power between scores. Then, all failure-rates are aggregated depending on the WCAG 2.0 guideline they belong to: Perceivable, Operable, Understandable and Robust. In this way, WAQM produces one score for each guideline in addition to an overall score.
- The severity of checkpoint violations is considered according to WCAG 1.0 priorities. Consequently, each failure-rate is weighted by this severity. The value of the weight is calculated by building test files that had a determined rate of accessibility violations. Different combinations of weights were tested and those weights that produced the scores that were closer to the accessibility ratio were chosen. In our case: 0.80 for priority 1, 0.16 for priority 2 and 0.04 for priority 3 checkpoints.
- In order to have a bigger picture and aiming at capturing as much information as possible from data in the evaluation report, the failure-rate of semi-automatic issues, those warnings that require human verification, is estimated. We tested some real Web pages and grouped failure-rates according to tests' WCAG 2.0 guideline membership (P, O, U, R) and their WCAG 1.0 priorities (obtaining thus, 12 groups of failure-rates) for both automatic and semi-automatic accessibility tests. We found a positive correlation between the failure-rate of each group. As a result, once the failure-rate of a specific group of automatic tests is known, this failure-rate is extrapolated to its paired group of semi-automatic tests.
Determining the validity and reliability of WAQM
A panel of experts evaluated the accessibility of 14 pages and gave a numeric score. We found a positive correlation between their scores and WAQM, r(14) = 0.55, p < 0.05. When it comes to reliability, we observed the behaviour of metrics in terms of the reproducibility and consistency of scores when different testing tools were used: LIFT and EvalAccess. 1363 pages from 15 sites were evaluated and then measured by WAQM. A very strong rank correlation between sites r(15) = 0.74 and between all pages r(1363) = 0.72 was obtained although no correlation was found between absolute units (Vigo et al., 2007).
Therefore we tuned the WAQM to obtain tool reliability for not only ranks but also for absolute scores. Particularly the function, and specifically the parameters, that spread out the failure-rate were manipulated. In (Vigo et al., 2009) we proposed a method to obtain the best spreading function for each tool in order to attain inter-tool reliability. The method consisted of testing all the permutations of the parameters in the spreading function to have the higher correlation between scores produced by LIFT and EvalAccess while still keeping a strong correlation for rankings.
Finally, we tested how WAQM behaved when Web pages were evaluated against another set of guidelines, WCAG 2.0 (Aizpurua et al., 2011). Note that hitherto, WCAG 1.0 had been used and results had been mapped to WCAG 2.0 guidelines. Again, 1449 pages were evaluated and WAQM was computed on them. Results show positive correlations r(1449) = 0.38, p < 0.00 and Spearman's r(1,449) = 0.50, p < 0.00. The weak and thus non-significant correlation in the former case means that obtained accessibility scores are quite different. Due to the difference in their means (M1 = 77.22, M2 = 25.01), we can conclude that WCAG 2.0 is more demanding than WCAG 1.0. As the nature of guidelines is different, no conclusion can be drawn with respect to metric reliability.
4. Major Difficulties
Difficulties consisted on setting up the infrastructure to run the validity and reliability tests. Data collection was not straightforward either. We believe that this calls for establishing methods to standardize such procedures and having a common set of test pages in order to be able to compare the performance of difference metrics.
Empirical data showed above suggest the validity of WAQM. Full reliability can be attained if a tuning process is followed, at least for EvalAccess and LIFT tools. Otherwise, WAQM is only reliable for ordinal or ranking values. A number of questions arouse while WAQM was designed and validated. Most of them pointed to having a framework to compare existing metrics. It is only in this way that the validity of a given metric can be measured and see how it performs with respect to some other metrics (Vigo and Brajnik, 2011).
6. Open Research Avenues
WAQM was deployed in search engines and this allowed to test the accessibility of the results provided by state-of-the-art search engines (Vigo et al., 2009b). It remains to be investigated how users perceive search engine results that are arranged according to their accessibility.
- A. Aizpurua, M. Arrue, M. Vigo and J. Abascal (2011). Validating the effectiveness of EvalAccess when deploying WCAG 2.0 tests. Universal Access in the Information Society 10(4): 425-441
- J. González, M. Macías, R. Rodríguez and F. Sánchez (2003). Accessibility Metrics of Web Pages for Blind End-Users. Web Engineering, LNCS 2722: 374-383
- L. Mich, M. Franch and L. Gaio (2003) Evaluating and Designing Web Site Quality. IEEE Multimedia 10(1): 34-43
- L. Olsina and G. Rossi (2002). Measuring Web Application Quality with WebQEM. IEEE Multimedia 9(4): 20-29
- T. Sullivan and R. Matson (2000) Barriers to use: usability and content accessibility on the web's most popular sites. ACM Conference on Universal Usability, CUU'00, 139–144
- M. Vigo, M. Arrue, G. Brajnik, R. Lomuscio and J. Abascal (2007) Quantitative Metrics for Measuring Web Accessibility. International Cross-Disciplinary Conference on Web Accessibility, W4A'07:99-107
- M. Vigo, G. Brajnik, M. Arrue and J. Abascal (2009) Tool Independence for the Web Accessibility Quantitative Metric. Disability and Rehabilitation: Assistive Technology 4(4): 248-263
- M. Vigo, M. Arrue and J. Abascal (2009b). Enriching Information Retrieval Results with Web Accessibility Measurement. Journal of Web Engineering 8(1): 3-24
- M. Vigo and G. Brajnik (2011). Automatic web accessibility metrics: where we are and where we can go. Interacting with Computers 23(2): 137-155