Important note: This Wiki page is edited by participants of the RDWG. It does not necessarily represent consensus and it may have incorrect information or information that is not supported by other Working Group participants, WAI, or W3C. It may also have some very useful information.
Benchmarking Web Accessibility Metrics
From Research and Development Working Group Wiki
This is an internal planning page. Please see the main Website Accessibility Metrics Symposium page
Page author(s): Markel Vigo
Automated Metrics, Evaluation, Benchmarking.
Definition and Background
In the web engineering domain, a metric can be the number of links, the size in KB of an HTML file, the number of users that click on a certain link, or the perceived ease of use of a web page. In the realm of web accessibility, amongst others, a metric can measure the following qualities:
- The number of pictures without an alt attribute;
- The number of Level A success criteria violations;
- The number of possible failure points where accessibility issues can potentially happen;
- The severity of an accessibility barrier;
- Time taken to conduct a task.
In order to measure more abstract qualities, higher level metrics are built upon other metrics and parameters are often used. For instance, readability metrics  take into account the number of syllables, words and sentences to measure the complexity of a text. Similarly, metrics aiming at measuring web accessibility have been built on specific qualities. For instance, the failure-rate metric computes the ratio between number of accessibility violations over the number of failure points. As a result of the computation of accessibility metrics, different types of data can be produced:
- Ordinal values like WCAG 2.0 conformance levels ("AAA", "AA", "A") or "accessible" vs "non-accessible" scores which can be computed by a metric stating that a web page is only accessible if it does not have accessibility barriers, otherwise its inaccessible.
- Ratio values such as 0, 175, -15, etc.
If we define web accessibility as the quality by which any user can operate web content without barriers of any kind, accessibility metrics aiming at measuring this phenomenon can take two approaches:
- Accessibility in terms of conformance to accessibility guidelines;
- Perceived accessibility. As guidelines conformance does not necessarily entail accessibility in use.
Most of the existing metrics are of the former type and are mainly built upon criteria produced by automatic testing tools such as the number of violations, their WCAG priority etc. Moreover, in order to overcome the lack of sensitivity and precision of existing ordinal metrics, conformance metrics yield ratio scores. The main reason for the widespread use of these types of metrics relies on their low cost in terms of time and human resources. As no human intervention (experts' audits or user tests) is required in the process, these methods are called automatic accessibility metrics. This does not necessarily entail that only fully automated success criteria (
earl:automatic) are to be considered. Some metrics estimate the violation rate of semi-automatic success criteria (
earl:semiAuto) and purely manual ones (
earl:manual) like in  and some other adopt an optimistic vs. conservative approach on their violation rate . The introduced error-rate of these estimations in addition to automatic metric reliance on testing tools are their major weaknesses. Therefore, automatic metrics inherit tool effectiveness shortcomings such as false positives, false negatives and specifity issues . A benchmarking survey on automatic conformance metrics concluded that existing metrics are quite divergent and most of them do not do a good job in distinguishing accessible pages from non-accessible pages . On the other hand, there are few metrics that combine testing tool metrics and those produced by human review, one example is SAMBA .
The benefits of using metrics
There are some scenarios that can benefit from using accessibility metrics:
- Quality assurance within Web Engineering as a way for developers to precisely know the accessibility level of their artifacts throughout the development cycle.
- Benchmarking as way to explore at a high-scale the accessibility level of web pages within domain or geographical location.
- Information Retrieval systems can implement metrics as one of the criteria to rank web pages. Therefore users would be able to retrieve not only pages that suit their information needs but also those that are accessible.
- Adaptive hypermedia techniques can benefit from metrics to enhance the interface to provide guidance or as a criteria to perform adaptations.
Based on the behaviour analysis of automatic conformance metrics in real web pages according to validity, reliability, adequacy and complexity criteria, and the particular requirements of the above-mentioned scenarios in this regard, the appropriateness of each metric at a given scenario is assessed in .
Open research questions and ideas
From the background review it can be concluded that there are many uncovered areas for those that want to conduct research on web accessibility metrics. For instance,
- How can we build metrics to measure accessibility in use?
- Since estimating
earl:semiAutoleads to introducing an error rate, how can we reduce this error rate? What sort of techniques can we explore? How do we build an infrastructure (ala IBM Social Accessibility) to store in advance web page audits by experts so that metrics can benefit from these data?
- How metrics that measure accessibility in terms of conformance relate to accessibility in use?
- Are there any low-level metrics like page size, number of images without alt tag or similar that are predictors of the accessibility of a web page?
To be more precise and focusing on investigating accessibility metric quality there are still many challenges to pursue. The way a metric satisfies validity, reliability, sensitivity, adequacy and complexity qualities remains open and can be addressed by the following questions extracted from :
This attribute is related to the extent to which the measurements obtained by a metric reflect the accessibility of the website to which it is applied. Studies of validity with respect to conformance could focus on the following research questions:
- Does validity of the metric change when we change guidelines?
- Does validity change when we use a subset of the guidelines?
- Does validity depend on the genre of the website?
- Is validity dependent on the type of data being provided by the testing tool?
- Does validity change when we switch the tool used to collect data? And what if we use data produced by merging results of two or more tools, rather than basing the metric on the data of a single tool?
- Are there quick ways to estimate validity of a metric?
The above questions could be addressed in the following way:
- By a panel of judges that would systematically evaluate all the pages using the same guidelines used by the tool(s).
- By artificially seeding web pages with known accessibility problems (i.e. violations of guidelines), and systematically investigate how these known problems affect the metric scores.
- By exploring the impact on validity of manual tests when (1) they are excluded or (2) their effect is estimated.
Studies of validity with respect to accessibility in use should overcome the evaluator effect  and lack of agreement of users in their severity ratings  and could address the following questions:
- Which factors affect this type of validity?
- Is it possible to estimate validity of the metric from other information that can be easily gathered?
- Is validity with respect to accessibility in use related to validity with respect to conformance?
This attribute is related to the reproducibility and consistency of scores i.e. the extent to which they are the same when evaluations of the same resources are carried out in different contexts (different tools, different people, different goals, different time). Some efforts to understand metric reliability could go in the following direction:
- How results produced by different tools vary when applied to the same site?
- Study the differences in the metric scores when metrics are fed with data produced by the same tool on the same web sites but when applying different guidelines.
- Analyse the effects of page sampling, a process that is necessary when dealing with large web sites or highly dynamic ones.
- See how reliability changes when merging the data produced by two or more evaluation tools applied to the same site.
- Analyse how reliability of a metric correlates with its validity.
Sensitivity of a metric is related to the extent that changes in the output of the metric are quantitatively related to changes of the accessibility of the site being analysed. Experiments could be set up to perform sensitivity analysis: given a set of accessibility problems in a test website, they could be systematically turned on or off, and their effects on metric values could be analysed to find out which kinds of problems had the largest effect and under which circumstances.
This is a general quality, encompassing several properties of accessibility metrics: the type of data used to represent scores, the precision in terms of the resolution of the scale, normalization, the span covered by actual values of the metric (distribution). Provided that a metric is valid and reliable, research directions about metric adequacy should analyse the suitability and usefulness of its values for users as well as metric visualization and presentation issues.
There is a challenge on the personalization of metrics as not all success criteria impact all users in the same way. While some have tried to group guidelines according to their impact in determined user groups, user needs can be so specific that the effect of a given barrier is more closely related to his/her individual abilities and cannot be inferred from user disability group membership. Individual needs may deviate considerably from groups guidelines (e.g., a motor-impaired individual having more residual physical abilities than the group guidelines foresee). Additionally, in a more fine-grained approach, users’ interaction context could be considered in metrics, encompassing the Assistive Technology (AT) they are using, the specific browser, plug-ins and operating system platform.
A proposal for a standardized set of rules to accomplish challenges
One option to have a common playground so that the research community could shed some light on these challenges, would be to organise the same kind of competitions (as the one described in Benchmarking_Web_Accessibility_Evaluation_Tools) with different criteria and resources, of course. The issues are very similar to those faced by evaluation tools.
- How do we create test collections?
- How do we select our test-participants? (The metrics highly depend on the tester)
- Do we make use of existing web pages?
- How do we inject accessibility defects in these pages?
- Which criteria do we use to rank metrics?
- How do we isolate the metric from the underlying testing tool?
- Which factors should influence metrics (e.g., defects per page for a given criterion, defects repetition due to a single defect on a server-side Web page template, WCAG severity level, etc.) ?
To start with, we could collect pages that we know are accessible, and pages where we know they are not (because we injected faults in them), and ask participants to apply their metrics to such pages and tell us how far apart are the accessible pages from the non-accessible ones.
- Readability test. http://en.wikipedia.org/wiki/Readability_test
- M. Vigo, M. Arrue, G. Brajnik, R. Lomuscio, J. Abascal (2007) Quantitative metrics for measuring web accessibility. W4A 2007, 99-107, DOI: 10.1145/1243441.1243465
- R. Lopes, D. Gomes, L. Carriço. (2010) Web not for all: a large scale study of web accessibility. W4A 2010, article 10, DOI: 10.1145/1805986.1806001
- G. Brajnik (2004) Comparing accessibility evaluation tools: a method for tool effectiveness. Universal Access in the Information Society 3(3-4), 252-263, DOI: 10.1007/s10209-004-0105-y
- M. Vigo and G. Brajnik (2011) Automatic web accessibility metrics:where we are and where we can go. Interacting With Computers 23(2), 137-155, DOI: doi:10.1016/j.intcom.2011.01.001 (Get PDF)
- G. Brajnik, R. Lomuscio (2007) SAMBA: a semi-automatic method for measuring barriers of accessibility. ASSETS 2007, 43-50, DOI: 10.1145/1296843.1296853
- G. Brajnik. The troubled path of accessibility engineering: an overview of traps to avoid and hurdles to overcome, Newsletter ACM SIGACCESS Accessibility and Computing, Issue 100, June 2011 (Get PDF)
- K. Hornbæk, E. Frøkjær (2008) A study of the evaluator effect in usability testing. Human-Computer Interaction 23 (3), 251–277, DOI: 10.1080/07370020802278205
- H. Petrie, O. Kheir (2007) Relationship between accessibility and usability of web sites. CHI’07, 397–406, DOI: 10.1145/1240624.1240688
post symposium analysis:
- M. Cooper (2011) Web accessibility metrics – "What are they for then?" -- This blog post addresses metrics from the perspective of a practitioner in a university, and includes analysis of the papers presented in the Symposium. It might be useful to read when planning the Symposium report. [SLH]
Back to the list of topics.