[contents]


Abstract

Web accessibility metrics are an invaluable tool for researchers, developers, governmental agencies and end users. Accessibility metrics help indicate the accessibility level of websites, including the accessibility level of individual websites, or even large-scale surveys of the accessibility of many websites. Recently, a plethora of metrics has been released to complement the A, AA, and AAA Levels measurement used by the WAI guidelines. However, the validity and reliability of most of these metrics are unknown and those making use of them are taking the risk of using inappropriate metrics. In order to address these concerns, this note provides a framework that considers validity, reliability, sensitivity, adequacy and complexity as the main qualities that a metric should have.

A symposium was organized to observe how current practices are addressing such qualities. We found that metrics addressing validity issues are scarce although some efforts can be perceived as far as inter-tool reliability is concerned. This is something that the research community should be aware of, as we might be making efforts by using metrics whose validity and reliability are unknown. The research realm is perhaps not mature enough or we do not have the right methods and tools. We therefore try to shed some light on the possible paths that could be taken so that we can reach a maturity point.

Status of this document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This 14 November 2012 Editor Draft of Research Report on Web Accessibility Metrics is intended to be published and maintained as a W3C Working Group Note, and provides a consolidated view of the outcomes of the Website Accessibility Metrics Online Symposium held on 5 December 2011.

The Research and Development Working Group (RDWG) believes it has addressed the comments received on the Public Working Draft of 30 August 2012 and invites further discussion and feedback on this draft document by research and practitioners interested in metrics for web accessibility, in particular by participants of the online symposium.

Please send comments on this Research Report on Web Accessibility Metrics document by [@@@ as soon as possible] to public-wai-rd-comments@w3.org (publicly visible mailing list archive).

Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This document has been produced by the Research and Development Working Group (RDWG, as part of the Web Accessibility Initiative (WAI) International Program Office.

This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.


Table of Contents

  1. Introduction
  2. A Framework for Quality of Accessibility Metrics
  3. Current Research
  4. A Research Roadmap for Web Accessibility Metrics
  5. Challenges
  6. Conclusions
  7. References
  8. Symposium Proceedings
  9. Acknowledgements

1. Introduction

The W3C/WAI Web Content Accessibility Guidelines (WCAG) and other WAI guidelines provide discrete conformance levels "A", "AA", and "AAA" to measure the level of accessibility. In many cases more granular scores would help provide a more precise indication on the level of accessibility. However, identifying valid, reliable, sensitive, adequate, and computable metrics that produce such scores is a non-trivial task with several challenges. This research report explores the qualities that such metrics need to demonstrate based on input from an Online Symposium on Website Accessibility Metrics held on 5 December 2011.

1.1 Definition and background

In the web engineering domain, a metric is a procedure for measuring a property of a web page or website. A metric can be the number of links, the size in KB of a HTML file, the number of users that click on a certain link, or the perceived ease of use of a web page. In the realm of web accessibility, amongst others, a metric can measure the following qualities:

In order to measure more abstract qualities, more sophisticated metrics are built upon more basic ones. For instance, readability metrics [readability] take into account the number of syllables, words and sentences contained in a document in order to measure the complexity of a text. Similarly, metrics aiming at measuring web accessibility have been built on specific qualities, which can be inherent in a website (such as images with no alt attribute) or observed from human behavior (e.g., user satisfaction ratings or performance indexes such as number of errors). For instance, the failure-rate metric computes the ratio between the number of accessibility violations of a particular set of criteria over the number of failure points for the same criteria.

As a result of the computation of accessibility metrics, different types of data can be produced:

Web accessibility can be viewed and defined in different ways [Brajnik08]. One way is to consider whether a web page/website is conformant to a set of requirements such as those defined by WCAG 2.0 or by Section 508. Even if WCAG 2.0 conformance levels are well specified and, as seen above, they are ordinal values, some other metrics could be defined on the basis of success criteria and their sufficient, advisory and failure techniques. We call these metrics, which are based on whether success criteria of given guidelines are met, conformance-based metrics.

Other metrics that go beyond conformance-based metrics can be defined. For example, the US federal procurement policy known as Section 508 defines accessibility as the extent to which "a technology [...] can be used as effectively by people with disabilities as by those without it". Provided that effectiveness can be measured, such metrics could yield results that differ from conformance-based ones. Analogous to the notion of "quality in use" for software, we call these accessibility-in-use metrics to emphasize that they try to measure performance indexes that can be shown by real users when using the website in specific situations. In addition, they do not require the notion of conformance with respect to a set of principles. Traditional usability metrics such as effectiveness, efficiency and satisfaction could be considered accessibility-in-use metrics. Also, any measure of the perceived accessibility of a web page by users is a metric belonging to this second group. Notice that this notion of accessibility covers not only accessibility of the content of web pages, but also accessibility of user agents, features of assistive technologies, and could even address different levels of expertise that users have with these resources.

Most of the existing metrics - see a review in [Vigo11a] - are of the former type because they are mainly built upon criteria implemented by automated testing tools such as the number of violations or their WCAG priority. Moreover, in order to overcome the lack of sensitivity and precision of ordinal metrics, conformance metrics often yield ratio scores. The main reason for the widespread use of these types of metrics is their their low cost in terms of time and human resources since they are based on automated tools. Although no human intervention (experts' audits or user tests) is required in the process, this does not necessarily mean that only fully automated success criteria are to be considered. Some metrics estimate the violation rate of semi-automated success criteria and purely manual ones like in [Vigo07]; some others adopt an optimistic vs. conservative approach on their violation rate [Lopes].

The error-rate of these estimations, due to their reliance on automated testing tools are the major weaknesses of automated conformance metrics. In fact, these metrics inherit tool shortcomings such as false positives and false negatives, that affect their outcome [Brajnik04].

A benchmarking survey on automated conformance metrics concluded that existing metrics are quite divergent and most of them do not do a good job in distinguishing accessible pages from non-accessible pages [Vigo11a]. On the other hand, there are metrics that combine testing tool metrics and those produced by human review, with the goal of estimating such errors; one example is SAMBA [Brajnik07]. Other metrics do not rely on automated testing at all; an example is the evaluation done with the AIR method [AIR].

1.2 The Benefits of Using Metrics

There are several scenarios that could benefit from automatic web accessibility metrics:

Automatic metrics have several advantages over those metrics that require human intervention: scenarios such as adaptive hypermedia and quality assurance require almost real-time scores, while information retrieval and benchmarking scenarios require the processing big volumes of data. These benefits are counterbalanced with the validity and reliability problems related to automatic testing which are discussed in the sections below. If the goal of measurement is to obtain a diagnose of the accessibility of a large number of sites (i.e., benchmarking scenario) large-scale evaluation should rely on sampling methods for efficiency purposes. In this regard, previous research analysed the behaviour of different metrics with respect to the sampling technique employed [Brajnik07b].

2. A Framework for Quality of Accessibility Metrics

Several quality factors can be defined for web accessibility metrics, factors that can be used to assess how applicable a metric is in a certain scenario and potentially, how to characterize the risks inherent in the use of a given metric. As discussed in [Vigo11a], validity, reliability, sensitivity, adequacy and complexity appear to be the most important qualities, which are based on psychometrics research [O'Donnell]. We also highlight that the importance of these qualities will be dependent on the application scenario where they are employed; for instance, intra-tool reliability is a desirable property in information retrieval systems as it is expected to employ just one tool to rank to sort search results. On the other hand, intra-tool reliability is a required in adaptive hypermendia applications as the inconsistencies generated by employing different tools can have undesirable consequences.

2.1 Validity

This attribute is related to the extent to which the measurements obtained by a metric reflect the accessibility of the website to which it is applied, and this could depend on the notion of accessibility: conformance vs. accessibility-in-use. The former refers to how a web document meets specific criteria (i.e., principles and guidelines), whereas the latter indicates how the interaction is perceived. These two perspectives are not necessarily the same which can be illustrated as follows: a picture without alternative text violates a guideline making a web page non-conformant; if the same picture is irrelevant for conducting a determined task, the lack of alternative text may not be perceived as an obstacle. This sort of situations are missed by guidelines in that trying to cover the broadest number of situations and range of users as possible is a challenging objective.

As discussed above, most existing conformance metrics are plagued by their reliance on automated testing tools and do not provide means to estimate the error rate of tools. Furthermore, the way the metric itself is defined could lead to other sources of errors, reducing its validity. For example, the failure rate should not be used as a measure of accessibility-in-use; using it as a measure of conformance is also controversial: it is sometimes claimed that it measures how well developers coped with accessibility features rather than providing an estimation of conformance [Brajnik11]. Validity with respect to accessibility-in-use should cope with the evaluator effect [Hornbæk] and lack of validity of users in their severity ratings [Petrie].

Validity is by far the most important quality attribute for accessibility metrics. Without it we would not know what a metric really measures. The risk of not being able to characterize validity of metrics is that potential users of metrics would choose those that appear easy to employ and that provide seemingly plausible results. In a sense, people may therefore choose a metric because it is simple rather than because it is a good metric, with the unforeseen consequence that incorrect claims and decisions could be made regarding web pages and websites. These are important issues as they strike at the heart of our notions of conformance. We are assessing the validity of a user interface without knowing if our method of assessment is actually valid itself.

2.2 Reliability

This attribute is related to the reproducibility and consistency of scores, i.e. the extent to which they are the same when evaluations of the same web pages are carried out in different contexts (different tools, different people, different goals, different time). Reliability of a metric depends on several layers that are interconnected. These range from the underlying tools (what happens if we switch tools?), to underlying guidelines (what happens if we switch guidelines?), to the evaluation process itself (if random choices are made, for example when scanning a large website).

The inherent inconsistency of unreliable metrics limits the ability of people to predict metric behavior; also, metrics limit the ability to be comprehended at a deeper level. However, reliability will not always be necessary. For instance, if we switch guideline sets we should not expect similar results as a different problem coverage is assumed.

It is worth noting that one of the aims of this research report is to help identify errors, or spot gaps in current metrics. The idea is that we can thereby confidently reject faulty metrics, or improve them in order to halt a process of "devaluation". This devaluation happens in the mind of the end user, in terms of the perceived value of the "ideal" of conformance. This process can be a by-product of poor metrics themselves or come from misunderstanding the output from metrics that are not clear or easy for end users to understand. In other words, if a metric is not stable, it is very difficult to effectively use it as a tool of either analysis or comprehension.

2.3 Sensitivity

Metric sensitivity is a measure of how changes in metric output are reflected in actual changes to any given website. Ideally we would like metrics not to be too sensitive so that they are robust and not over-reacting to small changes in web content. This is especially important when the metric is applied to highly dynamic websites as we show later in this note.

2.4 Adequacy

This is a general quality, encompassing several properties of accessibility metrics, for instance: the type of data used to represent scores, the precision in terms of the resolution of a scale, normalization, the span covered by actual values of the metric (distribution). These attributes determine if the metric can be suitably deployed in a given scenario. For example, to be able to compare accessibility levels of different websites (as would happen in the large scale scenario discussed above) metrics should provide normalized values as otherwise comparisons are not viable. If the distribution of values of the metric is concentrated on a small interval (such as between 0.40 and 0.60, instead of [0, 1]), large changes in accessibility could lead to small changes in the metric; round off errors could influence the final outcomes.

2.5 Complexity

Depending on the type and quantity of different data and the algorithm that is used to compute a metric, the process can be more or less computationally demanding with respect to certain resources, such as time, processors, bandwidth, memory. Therefore the complexity of a metric reflects the computational and human resources that prevent stakeholders from embracing accessibility metrics. Some scenarios rely on the fact that metrics have to be relatively simple (such as when metrics are used for adaptations of the user interface, and must be computed on the fly). However, some metrics may require high bandwidth to crawl large websites, large storage capacity or increased computing power. For those metrics that rely on human judgment, another complexity aspect is related to the workflow process that has to be established to resolve conflicts and synthesize a single value. As a result, these metrics may not suit particular application scenarios, budgets or resources.

3. Current Research

The papers that were presented at the symposium cover a broad span of issues addressing the quality factors we outlined above to different extents. However, they provide new insights and ask new questions that help shaping future research avenues (see section 4).

3.1 Addressing Validity and Reliability

Validity in terms of conformance was tackled by Vigo et al. [Vigo11b] by comparing automated accessibility scores with the ones given by a panel of experts, obtaining a strong positive correlation. Inter-tool reliability of metrics was also addressed by comparing the behavior of the WAQM metric assessing 1500 pages with two different tools (EvalAccess and LIFT). A very strong correlation was found when pages were ranked according to their scores; but to obtain the same effect with ratio scores the metric requires some ad-hoc adjustment. Finally, the authors investigated inter-guideline reliability between WCAG 1.0 and WCAG 2.0 finding again a very strong correlation between ordinal values although this effect fades out when looking at ratio data.

Fernandes and Benavidez [JFernandes] addressed metric reliability (UWEM and web@X) by comparing two tools (eChecker and eXaminator) with a different interpretation of success criteria and coverage, assessing the accessibility of about 300 pages. An initial experiment shows there is a positive moderate correlation between those tools.

Reliability of metrics very often depends on the reliability of the underlying testing tools, and it is well known that different tools produce different results on the same pages. During the webinar it was noted that this problem could lead to situations where low credibility is attributed to tools and metrics; metrics would make it even more difficult to compare different outcomes and diagnose bad behavior. In addition, stakeholders could be tempted to adopt the metrics that provides the best results on their pages, or those that can be more easily interpreted and explained, regardless of whether it is related to accessibility. However, as we mention previously, we should be cautious about when we should expect reliable behavior across tools, guidelines or domain.

3.2 Tool Support for Metrics

The availability of metrics in terms of publicly available algorithms, APIs or tools is critical for their broad adoption. Providing such mechanisms will help facilitate a broader adoption of metrics by stakeholders - especially by those that, even if interested in using them, do not have the resources to operate and articulate them. There are some incipient proposals in this direction that implement a set of metrics: Naftali and Clúa [Naftali] presented a platform where failure-rate and UWEM are deployed. However this does entail that human intervention is required as the system needs the input of experts to discard false positives. There are some other tools that help to keep track of the accessibility level of websites over time [Battistelli11a]. These sort of tools tend to target the accessibility monitoring of websites within determined geographical locations, normally municipalities or regional governments. The tool support provided by Fernandes et al. [NFernandes11a], QualWeb, incorporates a feature within traditional accessibility testing tools to detect templates; the novelty of this approach is that the metric employed uses the accessibility of the template as a baseline. As a result, accessibility is measured from such starting point. If the accessibility problems of the template were repaired, these fixes would automatically spread to all the pages built upon the template. Therefore, the distance from a particular web page to the template (or baseline) can be used to estimate the effort required to fix this instance, which is very valuable for quality assurance.

3.3 Addressing Large-Scale Measurement

Large scale evaluation and measurement is required for those websites that contain a great deal of pages or when a number of websites have to be evaluated. Managing these large volumes of data cannot be done without the help of automated tools. An example of large websites is provided by Fernandes et al. [NFernandes11a]. They present a method for template detection that aims at lessening the computing effort of evaluating large amounts of pages. This is useful for websites that are substantially built on templates such as on-line stores. In the on-line stores example, normally, the only content that changes is the item to be sold and the related information; however, the layout and internal structure stays the same. One example that contemplates the measurement of the accessibility of large number of distinct websites is depicted by Battistelli et al. [Battistelli11a] using the BIF metric; similarly, AMA is a platform that enables keeping track of a large number of websites which is used to measure how conformant the websites of specific geographical locations are. Finally, Nietzio et al. [Nietzio] present a metric to measure WCAG 2.0 conformance in the context of a platform to keep track of the accessibility of Norwegian municipalities.

3.4 Targeting Particular Accessibility Issues

Battistelli et al. [Battistelli11a] present a metric to quantify the compliance of documents with respect to their DTDs. Instead of measuring this compliance as if it was a binary variable (conformant/non-conformant), compliance is measured as the distance from the current document to the ideal one. Although its relationship with accessibility is not very apparent, code compliance is one of the technical accessibility requirements according to the Italian regulation and it also impacts on those success criteria that require the correct use of standards [see WCAG 2.0 Success Criterion 4.1.1 Parsing]. Also, this approach could be followed to measure accessibility. For instance, a web page could be improved until it was accessible according to guidelines or until it provides an acceptable experience to end users. The accessibility level of the non-accessible page could be computed in terms of the effort required to build the ideal web page in terms of coding lines, mark-up tags introduced or removed, or time. Another approach that tackles a particular accessibility problem is addressed by Rello and Baeza-Yates [Rello] who address the measurement of text legibility. This is something that affects the understandability of a document, a fundamental accessibility principle [see the Understandable principle]. The interesting contribution of this work is its reliance on a quantitative model of spelling errors automatically computed from a large set of pages handled by a search engine. Compliance with the DTD and legibility of a web document can be considered not only accessibility success criteria but also quality issues.

3.5 Novel Measurement Approaches

When it comes to innovative ways of measuring, the distance from a given document to a reference model can inspire similar approaches to measure web accessibility. As suggested by [Battistelli11b], compliance can be measured by considering the distance between a given document and an ideal (or acceptable) one. In this case this distance can be measured, for instance, in terms of missing hypertext tags or effort required to accomplish changes. Another example is illustrated by measuring the distance from a instance document to a baseline template using a metric [NFernandes11a]. Another novel way of measuring accessibility can be by using a grading scale and an arbitration process, as proposed by Fischer and Wyatt [Fischer]: the use of a five-point Likert scale aims at going beyond a binary accessible/non-accessible scoring scale. It would be interesting to see, in the future, how the final outcome of an evaluation depends on the original scores given by individual evaluators and what level of agreement exists between evaluators before arbitration takes place.

Vigo [Vigo11c] proposes a method by which, depending on the context, the number of checkpoints to be met changes. Nietzio et al. [Nietzio] suggest a stepwise method to measure conformance to WCAG 2.0, where aspects of success criteria applicability or tool support are considered. Such a method adapts to the specific testing procedures of WCAG 2.0 success criteria (SC) by providing a set of decision rules: first, the applicability of SC is analyzed; second, if applicable, the SC is tested; third, if a common failure is not found, the implementation of the sufficient techniques is checked; and finally, tool support is checked for the techniques identified in the previous step. The metric computed as a result of this process is a failure rate that takes into account also the logic underlying necessary, sufficient and common failures for each SC.

3.6 Beyond Conformance

Vigo [Vigo11c] proposes a method that not only considers guidelines when measuring accessibility conformance, but also considers the specific features of the device (e.g., screen size, keyboard support) as well as the assistive technology operated by the users. Including these contextual characteristics of the interaction could lead to more faithful measurements of the experience. Finally, Sloan and Kelly [Sloan] claim that understanding accessibility as conformance to guidelines is risky in those countries (e.g., the UK) where accessibility assessment is not limited to guidelines but also focuses on the delivered service and user experience. Therefore, they encourage moving forward and embracing accessibility in terms of user experience and thinking of conformance of the production process, rather than conformance of a product that constantly changes. This perspective is novel in that it looks beyond the current conformance paradigm and aims to tap more into the user experience, and this is something that is not necessarily defined by current methods of technical validation or document conformance.

3.7 Concluding Remarks

The authors of the above papers were inquired about some aspects of web accessibility metrics. The first aspect is about the target users of metrics; the goal of this question is to ascertain whether metrics researchers have in mind application scenarios or the profile of the end user who will make decisions based on the scores provided by metrics. Our survey shows that the majority of respondents do not have in mind a specific end user of metrics, or their answers are too generic. However, three papers are focused on web accessibility benchmarking (see [Nietzio, Battistelli11a, JFernandes]) and some others could be applied in this domain. This means that this is the application scenario with broader acceptance and where the application of metrics is taking off. In the remaining scenarios (quality assurance, information retrieval and adaptive web) there are also potential applications although the intent of applying in these scenarios is not evident.

Second, we wanted to know whether accessibility metrics researchers are aware of the costs and risk incurred by having incorrect values for metrics. Most users consider that validity and reliability of metrics should be guaranteed although many contemplate it as future work. There is some tendency towards employing experts in such validations although most agree that users will have the last word as far as validation is concerned. This is closely related to our last question about what is the research community's point of view on measuring accessibility beyond conformance metrics. All answers we received claimed that measuring accessibility in terms of user experience should be explored more thoroughly.

4. A Research Roadmap for Web Accessibility Metrics

This research report aims at highlighting current efforts in investigating accessibility metrics as well as uncovering existing challenges. Research on web accessibility metrics is taking off as the benefits of using them are becoming apparent; however, their adoption is far from being widespread. In addition to their relative novelty, this may occur because (1) there are a plethora of metrics out there and frameworks for metrics comparison that show their strengths and weakness are relatively recent [Vigo11a]; (2) quality frameworks require further investigation as there are unexplored areas for each of the defined qualities - these areas are uncovered in section 4.1; (3) the low validity of existing metrics, which calls for a standardized testbed to show how they perform with regard to metrics quality. Setting up a corpus of web pages for benchmarking purposes could be the first step towards this goal. It would work in the same way that the Information Retrieval community does to test the performance of their algorithms [see the Text Retrieval Conference, TREC ] - see section 5.1. A side-effect of the lack of validity and reliability of metrics is their lack of credibility. This could partially be tackled by the mentioned benchmarking corpus. However the credibility problem goes beyond - see section 5.2. Finally, some other issues such as user-tailored metric and dealing with dynamic content require special attention for those who aim at conducting research on web accessibility metrics.

4.1 Ensuring Metric Quality

To be more precise and focusing on investigating accessibility metric quality there are still many challenges to pursue. The way a metric satisfies validity, reliability, sensitivity, adequacy and complexity qualities remains open and can be addressed by the following questions. Even if all qualities are important, we emphasize that validity and reliability of metrics should be given priority. It does not matter how sensitive or adequate a metric is, if we cannot ensure its reliability and especially validity. The number of dimensions a web page should meet was analysed elsewhere [Vigo11a], where it was established that depending on the application scenario some qualities are required, desirable or optional.

4.2 Validity

Studies of "validity with respect to conformance" could focus on the following research questions:

The above questions could be addressed in the following way:

Studies of "validity with respect to accessibility in use" should overcome the evaluator effect [Hornbæk] and lack of agreement of users in their severity ratings [Petrie] and could address the following questions:

4.3 Reliability

There are a number of combinations (tool × metric) that open new research lines when it comes to metrics reliability: investigating how a specific metric changes when using different tools (inter tool reliability was covered in [Vigo11b] and [JFernandes]) or how different metrics behave with the same tool (inter-metric reliability) are some options, to name a few. Specifically, some efforts to understand metric reliability could go in the following direction:

4.4 Other Qualities

4.4.1 Sensitivity

Experiments could be set up to perform sensitivity analysis: given a set of accessibility problems in a test website, they could be systematically turned on or off, and their effects on metric values could be analyzed to find out which kinds of problems had the largest effect and under which circumstances. Provided that valid and reliable metrics were used, this could tell us which accessibility barriers would have a more or less strong impact on conformance or use.

4.4.2 Adequacy

Provided that a metric is valid and reliable, research directions about metric adequacy should analyze the suitability and usefulness of its values for users in different scenarios, as well as metric visualization and presentation issues.

4.4.3 Complexity

The most important issue about metric complexity relies on its relationship with the rest of the qualities. In this regard we can pose the following questions:

5. Challenges

5.1 A Corpus for Benchmarking Metrics

One option to have a common playground so that the research community could shed some light on these challenges would be to organize the same kind of competitions as the TREC experiments. Recently, some efforts have been directed towards this goal by the W3C or in the context of the BenToWeb project. There are several issues that need to be tackled.

To start with, pages we know are accessible could be collected, and pages where we know they are not (because we injected faults in them or collected from some other repositories such as www.fixtheweb.net), and ask participants to apply their metrics to such pages and tell us how far apart are the accessible pages from the non-accessible ones. Another option would be to use pages from initiatives such as the one promoted by the WAI, "BAD: Before and After Demonstration" where for educational purposes, the process of transforming a non-accessible page into an accessible one is shown.

5.2 Credibility issues

Accessibility scores are a great device to grasp the accessibility level of web pages. However, metrics can turn out to be a double-edged sword: while they enhance comprehension, they can also hide relevant information and details on the accessibility of a page. This side effect can lead end users to choose the most lenient scores among those metrics that are available. As a result, there is a risk of hindering the credibility and trust of accessibility metrics.

The fact that different evaluation tools yield different results directly affects metric validity and, in particular, metric reliability. The poor reproducibility of evaluation reports and accessibility scores has a side-effect on the perception of individuals in that the web accessibility assessment process can be regarded as not very credible.

5.3 User-tailored metrics

User-tailored metrics relate to accessibility in use and accessibility in terms of conformance in that in both cases the context may be considered. On the former approach, the interpretation of context would be broad: the task, website type, disability, assistive technology employed, etc. while on the latter the context is set by the success criteria.There is a challenge for the personalization of metrics as not all accessibility barriers and success criteria impact all users in the same way. While some have tried to group guidelines according to their impact in determined user groups, user needs can be so specific that the effect of a given barrier is more closely related to his/her individual abilities and cannot be inferred from the fact that a particular user is identified as having a particular disability. Individual needs may deviate considerably from groups guidelines (e.g., a motor-impaired individual having more residual physical abilities than the group guidelines foresee). There are some research actions that could be taken to improve user-tailored metrics:

5.4 Dealing with dynamic content

Measuring something that changes over time can give different results depending of the magnitude of such changes. Modern web pages are dynamic, changing their content over time. These changes are not always a reaction to user interaction but can also be due to some other factors such as time or location. Especially in Rich Internet Applications these updates are frequently provoked by scripting techniques that mutate web contents. Therefore, the mark-up gives few hints to predict the behavior of a web document. Normally, the most appropriate way to assess the current instance of a dynamic web document is to retrieve and test its DOM; then its subsequent mutations should be monitored and tested. As expected, different instances of a document caused by updates show inconsistent accessibility evaluation results [Fernandes11a]. As a result, if a metric is sensitive enough, it should be able to reflect this updates.

This area calls for research on the frequency of the testing, that is, should pages be tested every time they update or should it be retrieved at sampling intervals? Additionally, there are some other questions: what would be the accessibility score of a determined URL if page updates entail changes in the accessibility? Should an average of all instances be cumulated?

The conformance to WAI-ARIA and the accessibility elements subsumed by HTML5 could also be explored by future accessibility metrics.

5.5 Combining conformance-based accessibility and accessibility in use

Conformance-based accessibility and accessibility in use are different ways of measuring accessibility. The former checks how a web page conforms to a set of guidelines; to do so, automated tools, expert review and user tests are conducted. It is suggested that the latter may employ usability metrics to measure the efficiency and effectiveness of a web page (subjective metrics such as satisfaction are often used too). Each approach has its advantages and disadvantages: guidelines seem to be the only way to systematically operationalise accessibility evaluations; however, it was found that WCAG 2.0 conformance-based evaluations only catch a 50% of the problems encountered by users [Power]. Due to the high variability in human behaviour on the web, accessibility in use metrics demand high number of users to guarantee the external validity of metrics. That is why, in order to mitigate the weaknesses of each approach we suggest that both can be combined in a complementary way. Orchestrating conformance-based testing with accessibility in use metrics may help to obtain a broader coverage of accessibility problems.

5.6 Combining automated tests and expert review

One way of ameliorating the inherent lack of validity and reliability of automated metrics is to use experts in the calculation process (see SAMBA [Brajnik07]). The problem of this approach relies on the fact that it becomes semi-automatic. Therefore it introduces delays in those application scenarios defined in section 1.2 that require real-time computation of accessiblity scores. Also, it makes it unfeasible to test a vast amount of pages for large-scale scenarios. Innovative solutions that incorporate human judgement in automatic metrics while keeping the essence of automated test (fast response and large amount) will for sure make a difference in the future.

6. Conclusions

This research report introduces web accessibility metrics: they have been defined and specified, the benefits of using them have been highlighted and some possible application scenarios have been described. Spurred by the growing number of different metrics that are being released, we present a framework that encompasses the qualities that a good metric should have. As a result, metrics can be benchmarked according to their validity, reliability, sensitivity, adequacy and complexity. We believe this framework can help individuals to make decisions on the adoption of existing metrics according to the qualities required from metrics. In this way, there will not be the need to reinvent the wheel and design new metrics if available metrics already fit one's needs.

A symposium was held in order to check how metrics address the above-mentioned qualities and to keep track of current efforts targeting quality issues of accessibility metrics. The webinar provided a partial, but concrete, snapshot of most of the research activity around this topic We found that tool reliability is a recurrent topic in this regard, and there is still a long way to go in the realm of methods and examples for metric validity, which are rare. The editors of this research report believe that more efforts should be directed to investigate the validity and reliability of metrics. Employing metrics whose validity and reliability is questionable is a very risky practice that should be avoided. We therefore claim that accessibility metrics should be used and designed responsibly.

One way to hide the inherent complexity of metrics is to provide tools that facilitate their application in an automated or semi-automated way. This need for automation comes from the necessity of assessing large volumes of data and websites; that is why large scale analysis of accessibility calls for metrics that can easily be deployed and implemented. Some other efforts are targeting specific quality aspects of the Web such as the lexical quality or the compliance to DTDs. Finally, an emerging trend aims at measuring accessibility not only in pure compliance terms. Since contextual factors play an important role in determining the quality of user experience, accessibility measurement should be able to consider these factors by collecting and including them in the measurement process or by observing the behavior and performance of real users on real settings a la usability testing. This perspective can be understood as a complementary approach to current accessibility measurement practice.

Based on the needs and gaps that hinder current accessibility measurement we propose a number of research avenues that can help to boost the acceptance and quality of accessibility metrics. Mostly, quality issues of metric validity and reliability need urgent action but there are also some other actions that can help to make metrics more credible and widespread. A common corpus for metrics benchmarking would be a good step in this direction as it could potentially tackle quality and credibility issues at the same time. Dynamic content and user-tailoring aspects can open new research paths that can have strong impact on the quality of assessment practices, methodologies and tools.

7. References

8. Symposium Proceedings

Research Report on Web Accessibility Metrics

This document should be cited as follows:

M. Vigo, G. Brajnik, J. O Connor, eds. Research Report on Web Accessibility Metrics. 
     W3C WAI Research and Development Working Group (RDWG) Notes. (2012)
     Available at: http://www.w3.org/TR/accessibility-metrics-report

The latest version of this document is available at:

http://www.w3.org/TR/accessibility-metrics-report/

A permanent link to this version of the document is:

http://www.w3.org/TR/2012/WD-accessibility-metrics-report-20120830/

A BibTex file is provided containing:

@incollection {accessibility-metrics-report_FPWD,
  author = {W3C WAI Research and Development Working Group (RDWG)},
  title = {Research Report on Web Accessibility Metrics},
  booktitle = {W3C WAI Symposium on Website Accessibility Metrics},
  publisher = {W3C Web Accessibility Initiative (WAI)},
  year = {2012}, month = {August},
  editor = {Markel Vigo and Giorgio Brajnik and Joshue O Connor eds.},
  series = {W3C WAI Research and Development Working Group (RDWG) Notes},
  type = {Research Report},
  edition = {First Public Working Draft},
  url = {http://www.w3.org/TR/accessibility-metrics-report},
}

Contributed Extended Abstract Papers

The links provided in this section, including those in the BibTex files, are permanent; see also the W3C URI Persistence Policy.

@proceedings{accessibility-metrics-proceedings,
     title = {W3C WAI Symposium on Website Accessibility Metrics},
     year = {2011},
     editor = {W3C WAI Research and Development Working Group (RDWG)},
     series = {W3C WAI Research and Development Working Group (RDWG) Symposia},
     publisher = {W3C Web Accessibility Initiative (WAI)},
     url = {http://www.w3.org/WAI/RD/2011/metrics/},
}

9. Acknowledgements

Participants of the W3C WAI Research and Development Working Group (RDWG) involved in the development of this document include: Christos Kouroupetroglou, Giorgio Brajnik, Joshue O Connor, Klaus Miesenberger, Markel Vigo, Peter Thiessen, Shadi Abou-Zahra, Shawn Henry, Simon Harper, Vivienne Conway, and Yeliz Yesilada.

RDWG would also like to thank the chairs and scientific committee members as well as the paper authors of the RDWG online symposium on Website Accessibility Metrics.

This document was developed with support from the WAI-ACT Project.