This extend abstract is a contribution to the Online Symposium on Website Accessibility Metrics. The contents of this paper has not been developed by W3C Web Accessibility Initiative (WAI) and does not necessarily represent the consensus view of its membership.
The Web Accessibility Quantitative Metric (hereafter, WAQM) is designed to alleviate the need for measuring accessibility conformance in a more precise way than the qualitative one proposed by WCAG 1.0 (0, A, AA, AAA). The WAQM is an automatic metric, which means that human intervention is not required to be computed. Since it solely relies on reports produced by accessibility testing tools it is designed in such a way that tries to take the maximum advantage of evaluation report data.
In the early 2000's, Web quality frameworks were proposed to establish the criteria that websites should satisfy in order to guarantee their quality (see Mich et al. (2003) and Olsina and Rossi (2002)). Web accessibility was considered in these models although no clear means or methods were provided to quantify it. Some of the pioneering work was put forward by González et al. (2003) following the previously mentioned quality models and metrics. However, one of the first accessibility metric was the failure-rate by Sullivan and Matson (2000). To some extent, this intuitive metric has been borrowed by most metrics that were later proposed.
The goal of WAQM is not only to yield numeric scores but also to take into consideration the following aspects:
A panel of experts evaluated the accessibility of 14 pages and gave a numeric score. We found a positive correlation between their scores and WAQM, r(14) = 0.55, p < 0.05. When it comes to reliability, we observed the behaviour of metrics in terms of the reproducibility and consistency of scores when different testing tools were used: LIFT and EvalAccess. 1363 pages from 15 sites were evaluated and then measured by WAQM. A very strong rank correlation between sites r(15) = 0.74 and between all pages r(1363) = 0.72 was obtained although no correlation was found between absolute units (Vigo et al., 2007).
Therefore we tuned the WAQM to obtain tool reliability for not only ranks but also for absolute scores. Particularly the function, and specifically the parameters, that spread out the failure-rate were manipulated. In (Vigo et al., 2009) we proposed a method to obtain the best spreading function for each tool in order to attain inter-tool reliability. The method consisted of testing all the permutations of the parameters in the spreading function to have the higher correlation between scores produced by LIFT and EvalAccess while still keeping a strong correlation for rankings.
Finally, we tested how WAQM behaved when Web pages were evaluated against another set of guidelines, WCAG 2.0 (Aizpurua et al., 2011). Note that hitherto, WCAG 1.0 had been used and results had been mapped to WCAG 2.0 guidelines. Again, 1449 pages were evaluated and WAQM was computed on them. Results show positive correlations r(1449) = 0.38, p < 0.00 and Spearman's r(1,449) = 0.50, p < 0.00. The weak and thus non-significant correlation in the former case means that obtained accessibility scores are quite different. Due to the difference in their means (M1 = 77.22, M2 = 25.01), we can conclude that WCAG 2.0 is more demanding than WCAG 1.0. As the nature of guidelines is different, no conclusion can be drawn with respect to metric reliability.
Difficulties consisted on setting up the infrastructure to run the validity and reliability tests. Data collection was not straightforward either. We believe that this calls for establishing methods to standardize such procedures and having a common set of test pages in order to be able to compare the performance of difference metrics.
Empirical data showed above suggest the validity of WAQM. Full reliability can be attained if a tuning process is followed, at least for EvalAccess and LIFT tools. Otherwise, WAQM is only reliable for ordinal or ranking values. A number of questions arouse while WAQM was designed and validated. Most of them pointed to having a framework to compare existing metrics. It is only in this way that the validity of a given metric can be measured and see how it performs with respect to some other metrics (Vigo and Brajnik, 2011).
WAQM was deployed in search engines and this allowed to test the accessibility of the results provided by state-of-the-art search engines (Vigo et al., 2009b). It remains to be investigated how users perceive search engine results that are arranged according to their accessibility.