This paper is a contribution to the Website Accessibility Metrics Symposium. It was not developed by the W3C Web Accessibility Initiative (WAI) and does not necessarily represent the consensus view of W3C staff, participants, or members.
Integration of Web Accessibility Metrics into a Semi-Automatic evaluation process
1. Problem Addressed
Many authors  have analyzed and developed accurate metrics for Web accessibility evaluation processes to reflect results in a quantitative way. These metrics, especially the ones associated to semi-automatic processes, require an additional effort as they compute barriers rather than compliance with accessibility guidelines. In addition, there are several software and web-based applications - widely used in the software industry and education- that check the accessibility compliance according to WCAG 2.0 guidelines, and could easily have a related metric. However, those automatic tools introduce false positives that add noise to later measurement processes.
This study addresses the integration of several Web Accessibility metrics into a semi-automatic measuring process performed by prototype application that checks the accessibility of a website according to WCAG checkpoints. The idea behind this research is to implement some of the existing accessibility metrics and compute them automatically in the software, in order to analyze the contribution of this approach in a real scenario with existing Web pages. As automatic evaluations are not as reliable as the ones that incorporate human judgment, the proposed tool is also designed to let users filter the results before generating the metrics.
The goals of this experience are:
- Integrate metrics to evaluation processes, and calculate them automatically.
- Create a more efficient accessibility evaluation method based on semi-automatic assessment.
- Reduce later manual testing, which is expensive, by inferring the accessibility barriers from automated evaluations.
- (Additional goal) Simplify the way that human judgment is performed by following a simple process and a fluent user interface.
This study is based on previous research about accessibility measurement, especially the automatic and semi-automatic experiences performed in  and .The quantitative analysis of the relevant metrics is based on , while WCAG analysis is taken from .
A software application prototype named 'OceanAcc', which implements a semi-automatic accessibility evaluation process, was developed to address the problem of metric calculation. OceanAcc tool executes an automatic test of WCAG 2.0 guidelines, using ATutor Web services. Test results are stored, and the tool automatically matches them with a corresponding barrier by using G. Brajnik Barrier-Walkthorugh relationship. Then, the evaluators filter the results by removing checkpoint and barrier false violations, and also adding false negatives. With a reduced set of results, the tool generates and stores metrics of a specific site, and it also tracks the history.
The following metrics were selected for the study :
- WAB Score
- Failure rate 
- UWEM Score
Steps to evaluate accessibility, using the tool:
- Evaluators select a Web page and provide additional information: for instance, the number of images and videos. These values are used to estimate the number of failure points.
- Evaluators select the threshold level (A, AA, AAA).
- Checkpoint conformance test is performed, using Achecker  Restful Web service with WCAG 2.0 guidelines, and retrieving the results from the tool.
- Test results are tracked and matched with accessibility barriers.
- The tool displays in the screen a list of checkpoints violated and barriers found.
- Evaluators filter the items in the result list. They have to filter both lists, as each of them affects the computing of different metrics.
- Evaluators can add additional checkpoint violations that automatic processes do not check (For instance, readability).
- Metrics are computed and stored, using the filtered lists of violations and barriers.
One of the challenges of integrating different metrics in a certain process is that, even if many of them are based on WCAG 2.0 checkpoints assessment, they require additional information that is external to evaluation results. For instance, the Failure Rate metric requires the number of failures a specific website can have. For that purpose, the tool prototype renders an estimate of the existing failure points by prompting the user for specific elements on the website (images, video, etc.). For semi-automatic metrics based on barriers, the tool provides a matrix to match each of the checkpoint failures to one or more barriers. Additionally, many metrics adjust each failure with a weight to control its impact on the results. This is also considered when performing the barrier-checkpoint mapping.
4. Major Difficulties
The major difficulties presented during the experience of integration were related to the adaptation of results which came from an automatic input. The most relevant issues are detailed in the following list:
- Noise introduced by users with different knowledge of the process: When users are not familiar with some reported issues, they leave some false positives in the list of issues. This effect is more considerable in long tests.
- Threshold criteria and tool accuracy: Checkpoints with higher threshold (AAA) are harder to be tested by most of the tools, while many of them are omitted for automatic testing. That makes the metrics limited to the checkpoints that are assessed to automatic tests, and consequently, requires an adaptation. However, this restriction can be easily reduced by enabling users add manually these checkpoint verifications.
- How the extra parameters required in the metric formula are calculated: External data can introduce a significant bias. For instance, Failure Rate requires the number of failures on a certain page, which is manually calculated. To compute the metric automatically, that number was estimated by enumerating the number of elements that could fail.
After the execution of this experience, Failure Rate, WAB Score, UWEM Score, and False Positive rate metrics were computed from an automatic result set in a timely manner. A small set of websites was selected to run the test, which consisted in ACM, Yahoo Ar and Google home pages. The main empirical conclusions of this study are the following:
- The results from an automatic evaluation of accessibility have to be filtered before computing any metric, as many of the issues reported are not valid. Tools should consider this issue before processing the data.
- Metrics are useful to compare among pages, and computing them automatically adds valuable information to the existing tools.
- Manual barrier-based processes can be mapped to automatic checkpoint processes only to have an estimate of the results, because the matching process should be more accurate.
Integrating metrics to accessibility evaluation processes provides valuable information on the accessibility status of each page, also enabling a quantitative comparison among sites. However, in order to integrate metrics with accuracy, there are some open issues to solve. First, the extra parameters that are required to calculate metrics should be automatically generated to prevent any bias. As for the variability of the human criteria, processes should aid the less experienced evaluators. For instance, tools could track the decision history and suggest the most compelling alternatives based on the probability of occurrence, or some other pattern.
Finally, it would be interesting to explore the barrier-checkpoint mapping approach because evaluating checkpoints consumes fewer resources, while barrier metrics are more meaningful and can be easily matched with additional data. For instance, they can give a better idea of which groups of people with disabilities are really being affected by the low accessibility of a website.
To summarize, it has been demonstrated that metrics are useful from a quality assurance perspective. Integrating metrics into software tools is promising, and will result in a more efficient computing process.
6. Open Research Avenues
The resulting insights will open research avenues that will contribute to an enhanced metric integration, with less human intervention. This experience worked accordingly by assessing the set of WCAG checkpoints that can be easily tested by automatic tools.
However, the most complex checkpoints, which usually belong to AAA level, still require human intervention and a process adaptation to impact the results without introducing a considerable noise.
To deal with these WCAG checkpoints and facilitate the automatic calculation of metrics, the following approach is proposed (to be analyzed in a future study). Metrics should be categorized by process level to allow a better integration with tools:
- Basic Level: These metrics are calculated automatically. They only use the checkpoints that can be assessed programmatically and have no ambiguities.
- Semantic Level: These metrics compute the entire set of checkpoints, and require from human intervention (For instance, the readability of a document).
- Pragmatic level: These metrics measure in what level user experience is similar for all the users in all the contexts.
This research was part of my degree thesis, mentored by Eng. Ba. Osvaldo Clua (Universidad de Buenos Aires, Facultad de Ingenieria, 2010-04-15).
- André P. Freire, Renata P. M. Fortes, Marcelo A. S. Turine, and Debora M. B. Paiva. (2008) An evaluation of web accessibility metrics based on their attributes. In Proceedings of the 26th annual ACM international conference on Design of communication (SIGDOC '08). ACM, New York, NY, USA, 73-80. DOI:10.1145/1456536.1456551
- Giorgio Brajnik. (2008) Measuring Web Accessibility by Estimating Severity of Barriers. In Proceedings of the 2008 international workshops on Web Information Systems Engineering (WISE '08), Sven Hartmann, Xiaofang Zhou, and Markus Kirchberg (Eds.). Springer-Verlag, Berlin, Heidelberg, 112-121. DOI: 10.1007/978-3-540-85200-1_13
- Giorgio Brajnik. (2008). A comparative test of web accessibility evaluation methods. In Proceedings of the 10th international ACM SIGACCESS conference on Computers and accessibility (Assets '08). ACM, New York, NY, USA, 113-120. DOI: 10.1145/1414471.1414494
- Giorgio Brajnik and Raffaella Lomuscio (2007). SAMBA: a semi-automatic method for measuring barriers of accessibility. In Proceedings of the 9th international ACM SIGACCESS conference on Computers and accessibility (Assets '07). ACM, New York, NY, USA, 43-50. DOI:10.1145/1296843.1296853
- Brian Kelly, David Sloan, Stephen Brown, Jane Seale, Helen Petrie, Patrick Lauke, and Simon Ball.(2007) Accessibility 2.0: people, policies and processes. In Proceedings of the 2007 international cross-disciplinary conference on Web accessibility (W4A) (W4A '07). ACM, New York, NY, USA, 138-147. DOI: 10.1145/1243441.1243471
- Greg Gay and Cindy Qi Li. (2010). AChecker: open, interactive, customizable, web accessibility checking. In Proceedings of the 2010 International Cross Disciplinary Conference on Web Accessibility (W4A) (W4A '10). ACM, New York, NY, USA. Article 23 , 2 pages. DOI: 10.1145/1805986.1806019
- Bambang Parmanto and Xiaoming Zeng.(2005). Metric for web accessibility evaluation. J. Am. Soc. Inf. Sci.Technol. 56, 13 (November 2005), 1394-1404. DOI: 10.1002/asi.20233.