Research Note on Web Accessibility Metrics

W3C Editors' Draft 16 May 2012

This version:: http://www.w3.org/WAI/RD/2011/metrics/note/ED-metrics-20120516
Latest published version:: none
Latest internal version:: http://www.w3.org/WAI/RD/2011/metrics/note/ED-metrics
Previous published version:: none
Previous internal version:: http://www.w3.org/WAI/RD/2011/metrics/note/ED-metrics-20120509
Editors:: Markel Vigo, University of Manchester; Giorgio Brajnik, University of Udine; Joshue O Connor, NCBI Centre for Inclusive Technology

Web accessibility metrics are an invaluable tool for researchers, developers, governmental agencies and end users. Accessibility metrics help to better grasp the accessibility level of websites and are therefore helpful to make decisions based on the scores they produce. Recently, a plethora of metrics have been released; however the validity and reliability of most of these metrics is unknown and those making use of them are taking the risk of using inappropriate metrics. In order to overcome such situation, this note provides a framework that considers validity, reliability, sensitivity, adequacy and complexity as the main qualities that a metric should have.

A symposium was organised to observe how current practice is addressing such qualities. We found that metrics addressing validity issues is scarce although some efforts can be perceived as far as inter tool reliability is concerned. This is something that the research community should be aware of, as we might be making futile efforts by using metrics whose validity and reliability is unknown. The reseach realm is perhaps not mature enough or we do not have the right methods and tools. We therefore try to shed some light on the possible paths that could be taken so that we can reach a maturity point.

Status of this document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This 16 May 2012 Editors Draft deleted content: [First Public Working Draft] of Research Note on Web Accessibility Metrics is intended to be published and maintained as a W3C Working Group Note after review and refinement. The note provides an initial consolidated view of the outcomes of the Website Accessibility Metrics Online Symposium held on 5 December 2011.

The Research and Development Working Group (RDWG) invites discussion and feedback on this draft document by research and practitioners interested in metrics for web accessibility, in particular by participants of the online symposium. Specifically, RDWG is looking for feedback on:

Summaries of the extended abstracts contributed to the online symposium;
Discussion about the state-of-the-art and conclusions drawn in the document;
Related resources that may be useful to the discussion within the document.

Please send comments on this Research Note on Web Accessibility Metrics document by @@@ to @@@ (publicly visible mailing list archive).

Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This document has been produced by the Research and Development Working Group (RDWG, as part of the Web Accessibility Initiative (WAI) International Program Office.

This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. The groups do not expect this document to become a W3C Recommendation. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; this page also include instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.

1. Introduction

1.1 Definition and background

In the web engineering domain, a metric is a procedure for measuring a property of a web page or web site. A metrics can be the number of links, the size in KB of a HTML file, the number of users that click on a certain link, or the perceived ease of use of a web page. In the realm of web accessibility, amongst others, a metric can measure the following qualities:

In order to measure more abstract qualities, more sophisticated metrics are built upon more basic ones. For instance, readability metrics [readability] take into account the number of syllables, words and sentences contained in a document in order to measure the complexity of a text. Similarly, metrics aiming at measuring web accessibility have been built on specific qualities, which can be inherent in a website (such as images with no alt attribute) or observed from human behaviour (e.g., user satisfaction ratings or performance indexes such as number of errors). For instance, the failure-rate metric computes the ratio between the number of accessibility violations of a particular set of criteria over the number of failure points for the same criteria.

As a result of the computation of accessibility metrics, different types of data can be produced:

Web accessibility can be viewed and defined in different ways [Brajnik08]. One way is to consider whether a web page/website is conformant to a set of principles such as WCAG 2.0 or Section 508. Even if WCAG 2.0 conformance levels are well specified and, as seen above, they are ordinal values, some other metrics could be defined on the basis of success criteria and their sufficient, advisory and failure techniques. We call these metrics, which are based on whether success criteria of given guidelines are met, conformance-based metrics.

Other metrics can be defined if one assumes that accessibility is a quality that differs from conformance. For example, Section 508 defines accessibility as the extent to which "a technology [...] can be used as effectively by people with disabilities as by those without it". Provided that effectiveness can be measured, such metrics could yield results that differ from conformance-based ones. In analogy to the notion of "quality in use" for software, we call these accessibility-in-use metrics to emphasise that they try to measure performance indexes that can be shown by real users when using the web site in specific situations. In addition, they do not require the notion of conformance with respect to a set of principles. Traditional usability metrics such as effectiveness, efficiency and satisfaction could be considered accessibility-in-use metrics. Also, any measure of the perceived accessibility of a web page by users is a metric belonging to this second group. Notice that this notion of accessibility covers not only accessibility of the content of web pages, but also accessibility of user agents, features of assistive technologies, and could even address different levels of expertise that users have with these resources.

Most of the existing metrics - see a review in [Vigo11a] - are of the former type because they are mainly built upon criteria implemented by automatic testing tools such as the number of violations or their WCAG priority. Moreover, in order to overcome the lack of sensitivity and precision of ordinal metrics, conformance metrics often yield ratio scores. The main reason for the widespread use of these types of metrics relies on their low cost in terms of time and human resources since they are based on automatic tools. Although no human intervention (experts' audits or user tests) is required in the process, this does not necessarily entail that only fully automated success criteria are to be considered. Some metrics estimate the violation rate of semi-automatic success criteria and purely manual ones like in [Vigo07]; some others adopt an optimistic vs. conservative approach on their violation rate [Lopes].

The error-rate of these estimations, in addition to reliance on testing tools are the major weaknesses of automatic conformance metrics. In fact, these metrics inherit tool shortcomings such as false positives and false negatives, that affect their outcome [Brajnik04].

A benchmarking survey on automatic conformance metrics concluded that existing metrics are quite divergent and most of them do not do a good job in distinguishing accessible pages from non-accessible pages [Vigo11a]. On the other hand, there are metrics that combine testing tool metrics and those produced by human review, with the goal of estimating such errors; one example is SAMBA [Brajnik07]. Other metrics do not rely on tools at all; an example is the evaluation done with the AIR method [AIR] .

1.2 The Benefits of Using Metrics

2. A Framework for Quality of Accessibility Metrics

Several quality factors can be defined for web accessibility metrics, factors that can be used to assess how applicable a metric is in a certain scenario and potentially, how to characterize the risks that adopting a given metric yields. As discussed in [Vigo11a], validity, reliability, sensitivity, adequacy and complexity appear to be the most important factors.

2.1 Validity

This attribute is related to the extent to which the measurements obtained by a metric reflect the accessibility of the website to which it is applied, and this could depend on the notion of accessibility: conformance vs accessibility is use. The former refers to how a web document meets specific criteria (i.e, principles and guidelines), whereas the latter indicates how the interaction is perceived. These two perspectives are not necessarily the same and it can be illustrated as follows: a picture without alternative text violates a guideline making a web page non-confomant; however, the lack of alternative text may not be perceived as an obstacle if the goal of the user is to navigate or even purchase an item in a e-commerce site.

As discussed above, most existing conformance metrics are plagued by their reliance on automatic testing tools and do not provide means to estimate the error rate of tools. Furthermore, the way the metric itself is defined could lead to other sources of errors, reducing its validity. For example, the failure rate should not be used as a measure of accessibility-in-use; using it as a measure of conformance is also controversial: it is sometimes claimed that it measures how well developers coped with accessibility features rather than providing an estimation of conformance [Brajnik11]. Validity with respect to accessibility-in-use should cope with the evaluator effect [Hornbæk] and lack of validity of users in their severity ratings [Petrie].

Validity is by far the most important quality attribute for accessibility metrics. Without it we would not know what a metric really measures. The risk of not being able to characterize validity of metrics is that potential users of metrics would choose those that appear simple to be applicable and that provide seemingly plausible results. In a sense, people may therefore choose a metric because it is simple rather than because it is a good metric, with the unforeseen consequence that incorrect claims and decisions could be made regarding webpages and sites. These are important issues as they strike at the heart of our notions of conformance. We are assessing the validity of a user interface without truly knowing if our method of assessment is actually valid itself.

2.2 Reliability

This attribute is related to the reproducibility and consistency of scores, i.e. the extent to which they are the same when evaluations of the same web pages are carried out in different contexts (different tools, different people, different goals, different time). Reliability of a metric depends on several layers that are interconnected. These range from the underlying tools (what happens if we switch tools?), to underlying guidelines (what happens if we switch guidelines?), to the evaluation process itself (if random choices are made, for example when scanning a large site).

Unreliable metrics are not good because they are inconsistent, they limit the ability of people to predict their behavior and they limit the ability to comprehend them at a deeper level. However, reliability will not always be necessary. For instance, if we switch guideline sets we should not expect similar results as a different problem coverage is assumed.

It is worth noting that one of the aims of this research note is to help identify errors, or spot gaps in current metrics. The idea is that we can thereby both confidently reject faulty metrics, or improve them in order to halt a process of "devaluation". This devaluation happens in the mind of the end user, in terms of the perceived value of the "ideal" of conformance. This process can be a byproduct of poor metrics themselves or come from misunderstanding the output from metrics that are not clear or easy for end users to understand. In other words, if a metric is not stable, it is very difficult to effectively use it as a tool of either analysis or comprehension.

2.3 Sensitivity

Metric sensitivity is a measure of how changes in metric output are reflected in actual changes to any given website. Ideally we would like metrics not to be too sensitive so that they are robust and not over-reacting to small changes in web content. This is especially important when the metric is applied to highly dynamic websites as we show later in this note.

2.4 Adequacy

This is a general quality, encompassing several properties of accessibility metrics, for instance: the type of data used to represent scores, the precision in terms of the resolution of a scale, normalization, the span covered by actual values of the metric (distribution). These attributes determine if the metric can be suitably deployed in a given scenario. For example, to be able to compare accessibility levels of different websites (as would happen in the large scale scenario discussed above) metrics should provide normalized values as otherwise comparisons are not viable. If the distribution of values of the metric is concentrated on a small interval (such as between 0.40 and 0.60, instead of [0, 1]) then also large changes in accessibility could lead to small changes in the metric; roundoff errors could influence the final outcomes.

2.5 Complexity

Depending on the type and quantity of different data that is used to compute a metric and the algorithm which it is based on, the process can be more or less computationally demanding with respect to certain resources, such as time, processors, bandwidth, memory. Therefore the complexity of a metric reflects the computational and human resources that prevent stakeholders from embracing accessibility metrics. Some scenarios rely on the fact that metrics have to be relatively simple (such as when metrics are used for adaptations of the user interface, and have therefore to be computed on the fly). However, some metrics may require high bandwidth to crawl large websites, large storage capacity or increased computing power. For those metrics that rely on human judgment, another complexity aspect is related to the workflow process that has to be established to resolve conflicts and synthesize a single value. As a result, these metrics may not suit particular application scenarios, budgets or resources.

3. Current Research

The papers that were presented at the symposium cover a broad span of issues addressing the quality factors we outlined above to different extents. However, they provide new insights and ask new questions that help shaping future research avenues (see section 4).

3.1 Addressing Validity and Reliability

Validity in terms of conformance was tackled by Vigo et al. [Vigo11b] by comparing automatic accessibility scores with the ones given by a panel of experts obtaining a strong positive correlation. Inter-tool reliability of metrics was also addressed by comparing the behaviour of the WAQM metric assessing 1500 pages with two different tools (EvalAccess and LIFT). A very strong correlation was found when pages were ranked according to their scores; to obtain the same effect with ratio scores the metric requires some ad-hoc adjustment though. Finally, the authors investigated inter-guideline reliability between WCAG 1.0 and WCAG 2.0 finding again a very strong correlation between ordinal values although this effect fades out when looking at ratio data.
Fernandes and Benavides [JFernandes] addressed metric reliability (UWEM and web@X) by comparing two tools (eChecker and eXaminator) with a different interpretation of success criteria and coverage, assessing the accessibility of about 300 pages. An initial experiment shows there is a positive moderate correlation between those tools.

Reliability of metrics very often relies on reliability of the underlying testing tools, and it is well known that different tools produce different results on the same pages. Raised during the webinar was the point that this problem could lead to situations where low credibility is attributed to tools and metrics; metrics would make it even more difficult to compare different outcomes and diagnose bad behavior. In addition, stakeholders could be tempted to adopt the metrics that provides the best results on their pages, or those that can be more easily interpreted and explained, regardless of whether it is related to accessibility. However, as we mention previously, we should be cautious about when we should expect reliable behaviour across tools, guidelines or domain.

3.2 Tool Support for Metrics

The availability of metrics in terms of publicly available algorithms, APIs or tools is one critical issue so that the spreading of usage of metrics gains momentum and their adoption is fostered. Providing such mechanisms will help facilitate a broader adoption of metrics by stakeholders - especially by those that, even if interested in using them, do not have the resources to operate and articulate them. There are some incipient proposals in this direction that implement a set of metrics: Naftali and Clúa [Naftali] presented a platform where failure-rate and UWEM are deployed. However this does entail that human intervention is required as the system needs the input of experts to discard false positives. There are some other tools that help to keep track of the accessibility level of websites over time [Battistelli11a]. These sort of tools tend to target the accessibility monitoring of websites within determined geographical locations, normally municipalities or regional governments. The tool support provided by Fernandes et al. [NFernandes11a], QualWeb, incorporates a feature within traditional accessibility testing tools to detect templates; the novelty of this approach is that the metric employed uses the accessibility of the template as a baseline. As a result, accessibility is measured from such starting point. If the accessibility problems of the template were repaired, these fixes would automatically spread to all the pages built upon the template. Therefore, the distance from a particular web page to the template (or baseline) can be used to estimate the effort required to fix this instance, which is very appropriate for Quality Assurance scenarios.

3.3 Addressing Large-Scale Measurement

Large scale evaluation and measurement is required for those websites that contain a great deal of pages or when a number of websites have to be evaluated. Managing these large volumes of data cannot be done without the help of automated tools. An example of large websites is provided by Fernandes et al. [NFernandes11a]. They present a method for template detection that aims at lessening the computing effort of evaluating large amounts of pages. This can be useful, for instance, in those websites that massively rely on templates such as on-line stores. In this case, a vast majority of the pages follow a determined template. In the on-line stores example, normally, the only content that changes is the item to be sold and the related information; however, the layout and internal structure keeps the same. One example that contemplates the measurement of the accessibility of large number of distinct websites is depicted by Battistelli et al. [Battistelli11a] using the BIF metric; similarly, AMA is a platform that enables keeping track of a large number of websites which is used to measure how conformant to guidelines are the sites of specific geographical locations. Finally, Nietzio et al. [Nietzio] present a metric to measure WCAG 2.0 conformance in the context of a platform to keep track of the accessibility of Norwegian municipalities.

3.4 Targeting Particular Accessibility Issues

Battistelli et al. [Battistelli11a] present a metric to quantify the compliance of documents with respect to their DTDs. Instead of measuring this compliance as if it was a binary variable (conformant/non-confomant), compliance is measured as the distance of the current document to the ideal one. Although its relationship with accessibility is not very apparent, code compliance is one of the technical accessibility requirements according to the Italian regulation and it also impacts on those success criteria that claim for the correct use of standards [see WCAG SC 4.1.1 Parsing. Also, this approach could be followed to measure accessibility. For instance, a web page could be improved until it was accessible according to guidelines or until it provides an acceptable experience to end users. The accessibility level of the non-accessible page could be computed in terms of the effort required to build the ideal web page in terms of coding lines, mark-up tags introduced or removed, or time. Another approach that tackles a particular accessibility problem is addressed by Rello and Baeza-Yates [Rello] who address the measurement of text legibility. This is something that affects the understandability of a document, a fundamental accessibility principle [see the Understandable principle]. The interesting contribution of this work is its reliance on a quantitative model of spelling errors automatically computed from a large set of pages handled by a search engine. Compliance with the DTD and legibility of a web document can be considered not only accessibility success criteria but also quality issues.

3.5 Novel Measurement Approaches

When it comes to innovative ways of measuring, the distance from a given document to a reference model can inspire similar approaches to measure web accessibility. As suggested by [Battistelli11b], compliance can be measured considering the distance between a given document and an ideal (or acceptable) one. In this case this distance can be measured, for instance, in terms of missing hypertext tags or effort required to accomplish changes. Another example is illustrated by measuring the distance from a instance document to a baseline template using a metric [NFernandes11a]. Another novel way of measuring accessibility can be by using a grading scale and an arbitration process, as proposed by Fischer and Wyatt [Fischer]: the use of a five-point Likert scale aims at going beyond a binary accessible/non-accessible scoring scale. It would be interesting to see, in the future, how the final outcome of an evaluation depends on the original scores given by individual evaluators and what level of agreement exists between evaluators before arbitration takes place.

Vigo [Vigo11c] proposes a method that enables to manage a number of checkpoints that, depending on the contextual requirements, have to be simultaneously met or when the fulfillment of just one of them suffices. Nietzio et al. [Nietzio] suggest a stepwise method to measure conformance to WCAG 2.0, where aspects of success criteria applicability or tool support are considered. Such method adapts to the specific testing procedures of WCAG 2.0 success criteria (SC) by providing a set of decision rules: first, the applicability of SC is analysed; second, if applicable, the SC is tested; third, if a common failure is not found, the implementation of the sufficient techniques is checked; and finally, tool support is checked for the techniques identified in the previous step. The metric computed as a result of this process is a failure rate that takes into account also the logic underlying necessary, sufficient and counter-example techniques for each SC.

3.6 Beyond Conformance

Vigo [Vigo11c] proposes a method that not to only considers guidelines when measuring accessibility conformance, but also considers the specific features of the accessing device (e.g., screen size, keyboard support) as well as the assistive technology operated by the users. Including these contextual characteristics of the interaction could lead to more faithful measurements of the experience. Finally, Sloan and Kelly [Sloan] claim that understanding accessibility as conformance to guidelines is risky in those countries (e.g., the UK) where accessibility assessment is not limited to guidelines but it also focuses on the delivered service and user experience. Therefore, they encourage moving forward and embracing accessibility in terms of user experience in a time where user experience is becoming so salient and prevalent and think of conformance of the production process, rather than conformance of a product (that constantly changes). This perspective is novel in that it looks beyond the current conformance paradigm and aims to tap more into the user experience, and this is something that is not necessarily defined by current methods of technical validation or document conformance.

3.7 Concluding Remarks

The authors of the above papers where inquired about some aspects of web accessibility metrics. The first aspect is about the target users of metrics; the goal of this question is to ascertain whether metrics researchers have in mind application scenarios or the profile of the end user that will make decisions based on the scores provided by metrics. Our survey shows that the majority of respondents do not have in mind an specific end user of metrics or their answers are too generic. However, three papers are focused on web accessibility benchmarking (see [Nietzio, Battistelli11a, JFernandes]) and some other can potentially applied in such domain. This means that this is the application scenario with broader acceptance and where the application of metrics is taking off. In the remaining scenarios (quality assurance, information retrieval and adaptive web) again, there are potential applications although the intent of applying in these scenarios is not evident.

Secondly, we wanted to know whether accessibility metrics researches are aware of the costs and risk made on the base of wrong values of metrics. Most users consider that validity and reliability of metrics should be guaranteed although many contemplate it as future work. There is some tendency towards employing experts in such validations although most agree that users will have the last word as fas as validation is concerned. This is closely related our last question about what is the research community's point of view on measuring accessibility beyond conformance metrics. All answers we received claimed that measuring accessibility in terms of user experience should be explored more thoroughly.

4. A Research Roadmap for Web Accessibility Metrics

This research note aims at highlighting current efforts in investigating accessibility metrics as well as uncovering existing challenges. Research on web accessibility metrics is taking off as the benefits of using them are becoming apparent; however, their adoption is far from being widespread. In addition to their relative novelty, this may occur because (1) there are a plethora of metrics out there and frameworks for metrics comparison that show their strengths and weakness are relatively recent [Vigo11a]; (2) quality frameworks require further investigation as there are unexplored areas for each of the defined qualities - this areas are uncovered in section 4.1; (3) the low validity of existing metrics, which calls for a standardized testbed to show how they perform with regard to metrics quality. Setting up a corpus of web pages for benchmarking purposes could be the first step towards this goal. It would work in the same way that the Information Retrieval community does to test the performance of their algorithms [see the Text Retrieval Conference, TREC ] - see section 4.2. A side-effect of the lack of validity and reliability of metrics is their lack of credibility. This could partially be tackled by the mentioned benchmarking corpus. However the credibility problem goes beyond - see section 4.3. Finally, some other issues such as user-tailored metric and dealing with dynamic content require special attention for those who aim at conducting research on web accessibility metrics.

4.1 Ensuring Metric Quality

To be more precise and focusing on investigating accessibility metric quality there are still many challenges to pursue. The way a metric satisfies validity, reliability, sensitivity, adequacy and complexity qualities remains open and can be addressed by the following questions. Even if all qualities are important, we emphasize that validity and reliability of metrics should be given priority. No matter how sensitive or adequate a metric is if we cannot ensure its reliability and especially validity.

4.2 Validity

Studies of "validity with respect to conformance" could focus on the following research questions:

Studies of "validity with respect to accessibility in use" should overcome the evaluator effect [Hornbæk] and lack of agreement of users in their severity ratings [Petrie] and could address the following questions:

4.3 Reliability

Some efforts to understand metric reliability could go in the following direction:

4.4 Other Qualities

4.4.1 Sensitivity

Experiments could be set up to perform sensitivity analysis: given a set of accessibility problems in a test website, they could be systematically turned on or off, and their effects on metric values could be analysed to find out which kinds of problems had the largest effect and under which circumstances. Provided that valid and reliable metrics were used, this could tell us which accessibility barriers would have a more or less strong impact on conformance or use.

4.4.2 Adequacy

Provided that a metric is valid and reliable, research directions about metric adequacy should analyse the suitability and usefulness of its values for users in different scenarios, as well as metric visualization and presentation issues.

4.4.3 Complexity

The most important issue about metric complexity relies on its relationship with the rest of the qualities. In this regard we can pose the following questions:

5. A Corpus for Metrics Benchmarking

One option to have a common playground so that the research community could shed some light on these challenges would be to organise the same kind of competitions as the TREC experiments. Recently, some efforts have been directed towards this goal by the W3C or in the context of the BenToWeb project. There are several issues that need to be tackled.

To start with, pages we know are accessible could be collected, and pages where we know they are not (because we injected faults in them or collected from some other repositories such as www.fixtheweb.net), and ask participants to apply their metrics to such pages and tell us how far apart are the accessible pages from the non-accessible ones. Another option would be to use pages from initiatives such as the one promoted by the WAI, "BAD: Before and After demonstration" where for educational purposes, the process of transforming a non-accessible page into an accessible one is shown.

5.1 Credibility issues

Accessibility scores are a great device to grasp the accessibility level of web pages. However, metrics can turn out to be a double-edged sword: while they enhance comprehension, they can also hide relevant information and details on the accessibility of a page. This side effect can lead end users to choose the most lenient scores among those metrics that are available. As a result, there is a risk of hindering the credibility and trust of accessibility metrics.

The fact that different evaluation tools yield different results directly affects on metric validity and , in particular, on metric reliability. The poor reproducibility of evaluation reports and accessibility scores has a side-effect on the perception of individuals in that the web accessibility assessment process can be regarded as lowly credible.

5.2 User-tailored metrics

There is a challenge on the personalization of metrics as not all success criteria impact all users in the same way. While some have tried to group guidelines according to their impact in determined user groups, user needs can be so specific that the effect of a given barrier is more closely related to his/her individual abilities and cannot be inferred from user disability group membership. Individual needs may deviate considerably from groups guidelines (e.g., a motor-impaired individual having more residual physical abilities than the group guidelines foresee). There are some research actions that could be taken to improve user-tailored metrics:

5.3 Dealing with dynamic content

Measuring something that changes over time can give different results depending of the magnitude of such changes. Web pages these days are not an exception as dynamic content causes updates in Web documents. Web pages are alive and are not inert anymore as these changes are not always a reaction to user interaction but to some other factors such as time or location. Especially in Rich Internet Applications these updates are frequently provoked by scripting techniques that mutate web contents. Therefore, the mark-up can give little hints to predict the behaviour of a web document. Normally, the most appropriate way to assess the current instance of a dynamic web document is to retrieve and test its DOM; then its subsequent mutations should be monitored and tested. As expected, different instances of a document caused by updates show inconsistent accessibility evaluation results [Fernandes11]. As a result, if a metric is sensitive enough, it should be able to reflect this updates.

This area calls for research on the frequency of the testings, that is, should pages be tested every time they update or should be retrieved at sampling intervals? Additionally, there are some other questions: what would be the accessibility score of a determined URL if page updates entail changes in the accessibility? An average of all instances should be cumulated?

The conformance to WAI-ARIA and the accessibility elements subsumed by HTML5 could also be explored by future accessibility metrics.

6. Conclusions

This research note introduces web accessibility metrics: they have been defined and specified, the benefits of using them have been highlighted and some possible application scenarios have been described. Spurred by the growing number of different metrics that are being released, we present a framework that encompasses the qualities that a good metric should have. As a result, metrics can be benchmarked according to their validity, reliability, sensitivity, adequacy and complexity. We believe this framework can help individuals to make decisions on the adoption of existing metrics according to the qualities required from metrics. In this way, there will not be the need to reinvent the wheel and design new metrics in a blindfolded way if available metrics already fit one's needs.

A symposium was held in order to check how metrics address the above-mentioned qualities and to keep track of current efforts targeting quality issues of accessibility metrics. The webminar provided a partial, but concrete, snapshot of most of the research activity around this topic We found that tool reliability is a recurrent topic in this regard, whereas there is still a long way to go in the realm of methods and examples for metric validity, which are rare. The editors of this research note believe that more efforts should be directed to investigate the validity and reliability of metrics. Employing metrics whose validity and reliability is at stakes is a very risky practice that should be avoided. We therefore claim that accessibility metrics should be used and designed responsibly.

One way to hide the inherent complexity of metrics is to provide tools that facilitate their application in an automatic or semi-automatic way. This need for automatization comes from the necessity of assessing large volumes of data and websites; that is why large scale analysis of accessibility calls for metrics that can easily be deployed and implemented. Some other efforts are targeting specific quality aspects of the Web such as the lexical quality or the compliance to DTDs. Finally, an emerging trend aims at measuring accessibility not only in purely compliance terms. Since contextual factors play an important role on the user experience, accessibility measurement should be able to consider these factors by collecting and including them in the measurement process or by observing the behaviour and performance of real users on real settings a la usability testing. This perspective can be understood as a complementary approach to current accessibility measurement practice.

Based on the needs and gaps that hinder current accessibility measurement we propose a number of research avenues that can help to boost the acceptance and quality of accessibility metrics. Mostly, quality issues of metric validity and reliability need urgent action but there are also some other actions that can help to make metrics more credible and widespread. A common corpus for metrics benchmarking would be a good step in this direction as it could potentially tackle quality and credibility issues at the same time. Dynamic content and user-tailoring aspects can open new research paths that can have strong impact on the quality of assessment practices, methodologies and tools.

7. References

[Battistelli11a] M. Battistelli, S. Mirri, L.A. Muratori, P. Salomoni (2011) Measuring accessibility barriers on large scale sets of pages. W3C-RDWG Symposium on Website Accessibility Metrics, paper 2. http://www.w3.org/WAI/RD/2011/metrics/paper2/

[Battistelli11b] M. Battistelli, S. Mirri, L.A. Muratori, P. Salomoni (2011) A metrics to make different DTDs documents evaluations comparable. W3C-RDWG Symposium on Website Accessibility Metrics, paper 4. http://www.w3.org/WAI/RD/2011/metrics/paper4/

[Brajnik04] G. Brajnik (2004) Comparing accessibility evaluation tools: a method for tool effectiveness. Universal Access in the Information Society 3(3-4), 252-263, DOI: 10.1007/s10209-004-0105-y

[Brajnik07] G. Brajnik, R. Lomuscio (2007) SAMBA: a semi-automatic method for measuring barriers of accessibility. ASSETS 2007, 43-50, DOI: 10.1145/1296843.1296853

[Brajnik08] G. Brajnik (2008) Beyond Conformance: The Role of Accessibility Evaluation Methods. WISE Workshops 2008, 63-80, DOI: 10.1007/978-3-540-85200-1_9

[Brajnik11] G. Brajnik (2011) The troubled path of accessibility engineering: an overview of traps to avoid and hurdles to overcome. ACM SIGACCESS Accessibility and Computing Newsletter, Issue 100, June 2011.

[Fischer] D. Fischer, T. Wyatt (2011) The case for a WCAG-based evaluation scheme with a graded rating scale. W3C-RDWG Symposium on Website Accessibility Metrics, paper 7. http://www.w3.org/WAI/RD/2011/metrics/paper7/

[Hornbæk] K. Hornæk, E. Frœkjær (2008) A study of the evaluator effect in usability testing. Human-Computer Interaction 23 (3), 251-277, DOI: 10.1080/07370020802278205

[JFernandes] J. Fernandes, C. Benavidez (2011) A zero in eChecker equals a 10 in eXaminator: a comparison between two metrics by their scores. W3C-RDWG Symposium on Website Accessibility Metrics, paper 8. http://www.w3.org/WAI/RD/2011/metrics/paper8/

[Lopes] R. Lopes, D. Gomes, L. Carriço. (2010) Web not for all: a large scale study of web accessibility. W4A 2010, article 10, DOI: 10.1145/1805986.1806001

[Naftali] M. Naftali, O. Clúa (2011) Integration of Web Accessibility Metrics into a Semi-Automatic evaluation process. W3C-RDWG Symposium on Website Accessibility Metrics, paper 1. http://www.w3.org/WAI/RD/2011/metrics/paper1/

[Nfernandes11b] N. Fernandes, R. Lopes, L. Carriço (2011) On web accessibility evaluation environments. Proceedings of the International Cross-Disciplinary Conference on Web Accessibility, W4A 2011, article 4. DOI: 10.1145/1969289.1969295

[Nietzio] A. Nietzio, M. Eibegger, M. Goodwin, M. Snaprud (2011) Towards a score function for WCAG 2.0 benchmarking. W3C-RDWG Symposium on Website Accessibility Metrics, paper 11. http://www.w3.org/WAI/RD/2011/metrics/paper11/

[Petrie] H. Petrie, O. Kheir (2007) Relationship between accessibility and usability of web sites. CHI 2007, 397-406, DOI: 10.1145/1240624.1240688

[Rello] L. Rello, R. Baeza-Yates (2011) Lexical Quality as a Measure for Textual Web Accessibility. W3C-RDWG Symposium on Website Accessibility Metrics, paper 5. http://www.w3.org/WAI/RD/2011/metrics/paper5/

[Vigo07] M. Vigo, M. Arrue, G. Brajnik, R. Lomuscio, J. Abascal (2007) Quantitative metrics for measuring web accessibility. W4A 2007, 99-107, DOI: 10.1145/1243441.1243465

[Vigo11a] M. Vigo and G. Brajnik (2011) Automatic web accessibility metrics: where we are and where we can go. Interacting With Computers 23(2), 137-155, DOI: doi:10.1016/j.intcom.2011.01.001

[Vigo11b] M. Vigo, J. Abascal, A. Aizpurua, M. Arrue (2011) Attaining Metric Validity and Reliability with the Web Accessibility Quantitative Metric. W3C-RDWG Symposium on Website Accessibility Metrics, paper 6. http://www.w3.org/WAI/RD/2011/metrics/paper6/

8. Symposium Proceedings

Appendix A: How to cite this document

M. Vigo, G. Brajnik, J. O Connor (2012) W3C Research Note on Web Accessibility Metrics. BibTex file.

Appendix B: Acknowledgements

Participants of the W3C/WAI Research and Development Working Group (RDWG) involved in the development of this document include: Christos Kouroupetroglou, Giorgio Brajnik, Joshue O Connor, Klaus Miesenberger, Markel Vigo, Peter Thiessen, Shadi Abou-Zahra, Shawn Henry, Simon Harper, Vivienne Conway, and Yeliz Yesilada.