[contents]


Abstract

Web accessibility metrics are quality ratings that help indicate the level of website accessibility. Metrics are an important tool for researchers, developers, governmental agencies, and end users because they have the potential to communicate accessibility in a simple form, such as a number. A variety of website accessibility metrics have been proposed to complement the A, AA, and AAA levels measurement used by the WAI guidelines. However, the validity and reliability of most of these metrics are unknown so that using them introduces risks, such as relying on confusing or inaccurate ratings. This research report defines and explores the main qualities that website accessibility need to consider, including validity, reliability, sensitivity, adequacy and complexity.

This research report was created following an online symposium held on 5 December 2011 to study how current website accessibility metrics address different quality criteria. The study found out that metrics for which validity quality requirements are established are scarce. Some efforts relating to inter-tool reliability were observed though these do not seem to fully address the quality requirements either. It is important to note that one main outcome of this symposium is that the validity and reliability of web accessibility metrics are, to date, generally unknown. It seems that research in this field is not yet sufficiently mature to provide the necessary methods and tools to compute valid and reliable web accessibility metrics. Another outcome of this research report is a discussion on possible ways forward in accessibility research and development to help reach the necessary maturity point for valid and reliable web accessibility metrics.

Status of this document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This 26 May 2014 Editor Draft of Research Report on Web Accessibility Metrics is intended to be published and maintained as a W3C Working Group Note after review and refinement. The note provides a consolidated view of the outcomes of the Website Accessibility Metrics Online Symposium held on 5 December 2011.

This draft incorporates all changes to address the comments recevied on the previously published Working Draft of 30 August 2012. This draft is intended for final review by the Research and Development Working Group (RDWG) before publication as a W3C Working Group Note

Please send comments on this Research Report on Web Accessibility Metrics document to public-wai-rd-comments@w3.org (publicly visible mailing list archive).

Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This document has been produced by the Research and Development Working Group (RDWG, as part of the Web Accessibility Initiative (WAI) International Program Office.

This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.


Table of Contents

  1. Introduction
  2. A Framework for Quality of Accessibility Metrics
  3. Current Research
  4. A Research Roadmap for Web Accessibility Metrics
  5. Challenges
  6. Conclusions
  7. References
  8. Symposium Proceedings
  9. Acknowledgements

1. Introduction

The W3C/WAI Web Content Accessibility Guidelines (WCAG) and other WAI guidelines provide discrete conformance levels "A", "AA", and "AAA" to measure the level of accessibility. In many cases more granular scores would help provide a more precise indication on the level of accessibility. However, identifying valid, reliable, sensitive, adequate, and computable metrics that produce such scores is a non-trivial task with several challenges. This research report explores the qualities that such metrics need to demonstrate based on input from an Online Symposium on Website Accessibility Metrics held on 5 December 2011.

1.1 Definition and Background

In the web engineering domain, a metric is a procedure for measuring a property of an individual web page, website, or a collection of any number of websites. A metric can be the number of links, the size in KB of a HTML file, the number of users that click on a certain link, or the perceived ease of use of a web page. In the realm of web accessibility, amongst others, a metric can measure the following qualities:

In order to measure more abstract qualities, more sophisticated metrics are built upon more basic ones. For instance, readability metrics [readability] take into account the number of syllables, words, and sentences contained in a document in order to measure the complexity of text. Similarly, metrics aiming at measuring web accessibility have been built on specific qualities, which can be inherent in a website (such as images with no alt attribute) or observed from human behavior (e.g., user satisfaction ratings or performance indices such as number of errors). For instance, the failure-rate metric computes the ratio between the number of accessibility violations of a particular set of criteria over the number of failure points for the same criteria.

As a result of the computation of accessibility metrics, different types of data can be produced:

Web accessibility can be viewed and defined in different ways [Brajnik08]. One way is to consider whether a web page or website is conformant to a set of requirements such as those defined by WCAG 2.0 or by Section 508. While WCAG 2.0 conformance levels are well specified and, as seen above, they are ordinal values, other metrics could be defined on the basis of the WCAG 2.0 conformance requirements and success criteria. Metrics that are based on measuring conformance to a given set of criteria, such as WCAG 2.0, can be called conformance-based metrics.

Metrics other than conformance-based metrics can also be defined. For example, the US federal procurement policy known as Section 508 defines accessibility as the extent to which "a technology [...] can be used as effectively by people with disabilities as by those without it". Provided that effectiveness can be measured, metrics based on this definition for accessibility could yield results that differ from conformance-based ones. Analogous to the notion of "quality in use" for software (ISO 9241-11), accessibility-in-use metrics can be defined. These differ from conformance-based metrics in that they measure performance indices (such as success level or mental load) of real users using a web page or website in specific situations rather than to measure conformance. Thus they do not depend on the notion of conformance with respect to a set of requirements, such as WCAG 2.0. Also, any measure of the accessibility of a web page or website as perceived by users, as opposed to measuring the performance of users, is equally an accessibility-in-use metric. Traditional usability metrics such as effectiveness, efficiency, and satisfaction could be considered accessibility-in-use metrics. Note that this notion of accessibility does not only cover accessibility of the content of web pages, but also accessibility of user agents, features of assistive technologies, and could even address different levels of expertise that users have with these resources.

Most of the existing metrics - see a review in [Vigo11a] - are conformance-based because they are mainly built upon criteria implemented by automated testing tools such as the number of violations and their WCAG priority. Moreover, in order to overcome the lack of sensitivity and precision of ordinal metrics, conformance-based metrics often yield ratio scores. The main reason for the widespread use of these types of metrics is their efficacy in terms of time and human resources since they are based on automated tools.

Although most existing conformance-based metrics are fully automated and do not require human intervention (experts' audits or user tests) for their calculation, they are not all based on fully automatable accessibility checks. For example, some metrics estimate the violation rate of semi-automated and purely manual accessibility checks, such as those metrics described in [Vigo07]. Other metrics adopt optimistic or conservative approaches to estimate the violation rate [Lopes].

The error-rate of these estimations, due to their reliance on automated testing tools, is the major weakness of fully automated conformance-based metrics. In fact, these metrics inherit tool shortcomings such as false positives and false negatives that affect their outcome [Brajnik04].

A benchmarking survey on fully automated conformance-based metrics concluded that existing metrics are quite divergent and most of them do not adequately distinguish accessible pages from non-accessible pages [Vigo11a]. On the other hand, there are conformance-based metrics that combine automated accessibility checks and those that require human review, with the goal of better estimating errors; one example is SAMBA [Brajnik07]. Other conformance-based metrics do not rely on automated accessibility testing at all; an example is the evaluation done with the AIR method [AIR].

1.2 Benefits of Web Accessibility Metrics

There are several situations that demonstrate the benefits of web accessibility metrics; these include:

1.2.1 Advantages of Fully Automated Metrics

Fully automated metrics have several advantages over those metrics that require human intervention: they are quick and can effectively cover large volumes of data. For example, scenarios such as adaptive hypermedia and quality assurance require almost real-time scores, while information retrieval and benchmarking scenarios require the processing of large volumes of data, both of which can only be effectively accomplished using automated metrics. These advantages are counterbalanced with problems of validity and reliability related to fully automated testing that are discussed in the sections below. When the goal of measurement is to assess the level of accessibility of a large number of websites (i.e., benchmarking scenario), large-scale evaluation should rely on sampling methods for efficiency purposes. In this regard, previous research analyzed the behavior of different metrics with respect to the sampling technique employed [Brajnik07b].

2. Quality Attributes of Web Accessibility Metrics

Several quality factors can be defined for web accessibility metrics. These factors can be used to assess how applicable a metric is in a certain scenario and, potentially, how to characterize the risks inherent to the use of a given metric. As discussed in [Vigo11a], validity, reliability, sensitivity, adequacy, and complexity appear to be the most important qualities, which are based on psychometrics research [O'Donnell]. We also highlight that the importance of these qualities will be dependent on the application scenario where they are employed. For instance, inter-tool reliability (consistency among tool results) is a desirable property in information retrieval systems because in such a scenario one could expect to employ just one tool to rank and sort search results. On the other hand, inter-tool reliability is a required property in adaptive hypermedia applications because the inconsistencies generated by different tools have undesirable consequences.

2.1 Validity

This attribute is related to the extent to which the measurements obtained by a metric reflect the accessibility of the website to which it is applied. This could depend on the notion of accessibility: conformance-based versus accessibility-in-use. The former refers to how a web document meets specific criteria (i.e., principles and guidelines), whereas the latter indicates how the interaction is perceived. These two perspectives are not necessarily the same which can be illustrated as follows: a picture with an incorrect text alternative violates a guideline making a web page non-conformant; if the same picture is irrelevant for conducting a determined task, the incorrect alternative text may not be perceived as an obstacle. These sorts of situations are missed by guidelines in that trying to cover the broadest number of situations and range of users as possible is a challenging objective.

As discussed above, most existing conformance metrics are plagued by their reliance on automated testing tools and do not provide means to estimate the error rate of tools. This negatively impacts their validity because incorrect results influence the metric. Also the way the metric itself is defined could lead to other sources of errors that further reduce its validity. For example, the failure rate (frequency with which a certain type of content – such as images – fails to meet certain criteria) should not be used as a measure of accessibility-in-use. Using it as a measure of conformance is also controversial: it is sometimes claimed that it only measures how well developers avoided known errors rather than providing an estimation of conformance [Brajnik11]. Validity with respect to accessibility-in-use should cope with the evaluator effect [Hornbæk], the fact that results obtained when performing a usability evaluation depends on who performs the test. It also needs to address the phenomenon that users tend to under-rate the severity of issues they perceive [Petrie].

Validity is by far the most important quality attribute for accessibility metrics. Without it we would not know what a metric really measures. The risk of not being able to characterize validity of metrics is that potential users of metrics would choose those that appear easy to employ and that provide seemingly plausible results. In a sense, people may therefore choose a metric because it is simple rather than because it is a good metric, with the unforeseen consequence that incorrect claims and decisions could be made regarding web pages and websites. These are important issues as they strike at the heart of our notions of conformance. We are assessing the validity of a user interface without knowing if our method of assessment is actually valid itself.

2.2 Reliability

This attribute is related to the reproducibility and consistency of scores, i.e. the extent to which they are the same when evaluations of the same web pages are carried out in different contexts (different tools, different people, different goals, different time, etc.). Reliability of a metric depends on several layers that are interconnected. These range from the underlying tools (what happens if we switch tools?), to underlying guidelines (what happens if we switch guidelines?), to the evaluation process itself (if random choices are made, for example when scanning a large website).

The inherent inconsistency of unreliable metrics limits the ability of people to predict metric behavior. Also, metrics limit the ability of problems to be comprehended at a deeper level. However, reliability will not always be necessary. For instance, if we switch guideline sets we should not expect similar results because coverage of a different problem is then assumed.

It is worth noting that one of the aims of this research report is to help identify errors, or spot gaps in current metrics. The idea is that we can thereby confidently reject faulty metrics, or improve them in order to halt a process of "devaluation". This devaluation happens in the mind of the end user, in terms of the perceived value of the "ideal" of conformance. This process can be a by-product of poor metrics themselves or come from misunderstanding the output from metrics that are not clear or easy for end users to understand. In other words, if a metric is not stable, it is very difficult to effectively use it as a tool of either analysis or comprehension.

2.3 Sensitivity

Metric sensitivity is a measure of how changes in metric output reflect actual changes to any given website. Ideally we would like metrics not to be too sensitive so that they are robust and not over-reacting to small changes in web content. This is especially important when the metric is applied to highly dynamic websites as we show later in this note.

2.4 Adequacy

This is a general quality, encompassing several properties of accessibility metrics, for instance: the type of data used to represent scores, the precision in terms of the resolution of a scale, normalization, the span covered by actual values of the metric (distribution). These attributes determine if the metric can be suitably deployed in a given scenario. For example, to be able to compare accessibility levels of different websites (as would happen in the large scale scenario discussed above) metrics should provide normalized values as otherwise comparisons are not viable. If the distribution of values of the metric is concentrated on a small interval (such as between 0.40 and 0.60, instead of [0, 1]), large changes in accessibility could lead to small changes in the metric; round off errors could influence the final outcomes.

2.5 Complexity

Depending on the type and quantity of different data and the algorithm that is used to compute a metric, the process can be more or less computationally demanding with respect to certain resources, such as time, processors, bandwidth, and memory. Therefore the complexity of a metric reflects the computational and human resources that prevent stakeholders from embracing accessibility metrics. Some scenarios rely on the fact that metrics have to be relatively simple (such as when metrics are used for adaptations of the user interface, and must be computed on the fly). However, some metrics may require high bandwidth to crawl large websites, large storage capacity, or increased computing power. For those metrics that rely on human judgment, another complexity aspect is related to the workflow process that has to be established to resolve conflicts and synthesize a single value. As a result, these metrics may not suit particular application scenarios, budgets or resources.

3. Current Research

The papers that were presented at the symposium cover a broad span of issues addressing the quality factors we outlined above to different extents. However, they provide new insights and ask new questions that help shaping future research avenues (see section 4).

3.1 Addressing Validity and Reliability

Validity in terms of conformance was tackled by Vigo et al. [Vigo11b] by comparing automated accessibility scores with the ones given by a panel of experts, obtaining a strong positive correlation. Inter-tool reliability of metrics was also addressed by comparing the behavior of the WAQM metric assessing 1500 pages with two different tools (EvalAccess and LIFT). A very strong correlation was found when pages were ranked according to their scores; but to obtain the same effect with ratio scores the metric requires some ad-hoc adjustment. Finally, the authors investigated inter-guideline reliability between WCAG 1.0 and WCAG 2.0 finding again a very strong correlation between ordinal values although this effect fades out when looking at ratio data.

Fernandes and Benavidez [JFernandes] addressed metric reliability (UWEM and web@X) by comparing two tools (eChecker and eXaminator) with a different interpretation of success criteria and coverage, assessing the accessibility of about 300 pages. An initial experiment shows there is a positive moderate correlation between those tools.

Reliability of metrics very often depends on the reliability of the underlying testing tools, and it is well known that different tools produce different results on the same pages. During the webinar it was noted that this problem could lead to situations where low credibility is attributed to tools and metrics; metrics would make it even more difficult to compare different outcomes and diagnose bad behavior. In addition, stakeholders could be tempted to adopt the metrics that provide the highest scores, or those that can be more easily interpreted and explained, regardless of their accuracy. However, as previously mentioned, caution needs to be taken when reliable behavior across tools, guidelines, and domains are expected.

3.2 Tool Support for Metrics

The availability of metrics in terms of publicly available algorithms, APIs or tools is critical for their broad adoption. Providing such mechanisms will help facilitate a broader adoption of metrics by stakeholders - especially by those that, even if interested in using them, do not have the resources to apply and understand them. There are some initial proposals in this direction that implement a set of metrics: Naftali and Clúa [Naftali] presented a platform where failure-rate and UWEM are deployed. However this does entail that human intervention is required as the system needs the input of experts to discard false positives. There are some other tools that help to keep track of the accessibility level of websites over time [Battistelli11a]. These types of tools tend to target the accessibility monitoring of websites within determined geographical locations, normally municipalities or regional governments. The tool support provided by Fernandes et al. [NFernandes11a], QualWeb, incorporates a feature within traditional accessibility testing tools to detect templates. The novelty of this approach is that the metric employed uses the accessibility of the template as a baseline. As a result, accessibility is measured from such starting point. If the accessibility problems of the template were repaired, these fixes would automatically spread to all the pages built upon the template. Therefore, the distance from a particular web page to the template (or baseline) can be used to estimate the effort required to fix this instance, which is very valuable for quality assurance.

3.3 Addressing Large-Scale Measurement

Large scale evaluation and measurement is required for those websites that contain a great deal of pages or when a number of websites have to be evaluated. Managing these large volumes of data cannot be done without the help of automated tools. An example of large websites is provided by Fernandes et al. [NFernandes11a]. They present a method for template detection that aims at lessening the computing effort of evaluating large amounts of pages. This is useful for websites that are substantially built on templates such as on-line stores. In the on-line stores example, normally, the only content that changes is the item to be sold and the related information. However, the layout and internal structure stays the same. One example that contemplates the measurement of accessibility over large numbers of distinct websites is depicted by Battistelli et al. [Battistelli11a] using the BIF metric. Similarly, AMA is a platform that enables keeping track of a large number of websites which is used to measure how conformant different websites from specific geographical locations are. Nietzio et al. [Nietzio] present a metric to measure WCAG 2.0 conformance in the context of a platform to keep track of the accessibility of Norwegian municipalities.

3.4 Targeting Particular Accessibility Issues

Battistelli et al. [Battistelli11a] present a metric to quantify the compliance of documents with respect to their DTDs. Instead of measuring this compliance as if it were a binary variable (conformant/non-conformant), compliance is measured as the distance from the current document to the ideal one. Although its relationship with accessibility is not very apparent, code compliance is one of the technical accessibility requirements according to the Italian regulation and it also impacts on those success criteria that require the correct use of standards [see WCAG 2.0 Success Criterion 4.1.1 Parsing]. Also, this approach could be followed to measure accessibility. For instance, a web page could be improved until it was accessible according to guidelines or until it provides an acceptable experience to end users. The accessibility level of the non-accessible page could be computed in terms of the effort required to build the ideal web page in terms of coding lines, mark-up tags introduced or removed, or time. Another approach that tackles a particular accessibility problem is addressed by Rello and Baeza-Yates [Rello] who address the measurement of text legibility. This is something that affects the understandability of a document, a fundamental accessibility principle [see the Understandable principle]. The interesting contribution of this work is its reliance on a quantitative model of spelling errors automatically computed from a large set of pages handled by a search engine. Compliance with the DTD and legibility of a web document can be considered not only accessibility success criteria but also quality issues.

3.5 Novel Measurement Approaches

When it comes to innovative ways of measuring, the distance from a given document to a reference model can inspire similar approaches to measure web accessibility. As suggested by [Battistelli11b], compliance can be measured by considering the distance (in terms of differences) between a given document and an ideal (or acceptable) one. In this case this distance can be measured, for instance, in terms of missing hypertext tags or effort required to accomplish changes. Another example is illustrated by measuring the distance from an instance document to a baseline template using a metric [NFernandes11a]. Another novel way of measuring accessibility can be by using a grading scale and an arbitration process, as proposed by Fischer and Wyatt [Fischer]: the use of a five-point Likert scale aims at going beyond a binary accessible/non-accessible scoring scale. It would be interesting to see, in the future, how the final outcome of an evaluation depends on the original scores given by individual evaluators and what level of agreement exists between evaluators before arbitration takes place.

Vigo [Vigo11c] proposes a method by which, depending on the context, the number of checkpoints to be met changes. Nietzio et al. [Nietzio] suggest a stepwise method to measure conformance to WCAG 2.0, where aspects of success criteria applicability or tool support are considered. Such a method adapts to the specific testing procedures of WCAG 2.0 success criteria (SC) by providing a set of decision rules: first, the applicability of SC is analyzed; second, if applicable, the SC is tested; third, if a common failure is not found, the implementation of the sufficient techniques is checked; and finally, tool support is checked for the techniques identified in the previous step. The metric computed as a result of this process is a failure rate that takes into account also the logic underlying necessary, sufficient techniques and common failures for each SC.

3.6 Beyond Conformance

Vigo [Vigo11c] proposes a method that not only considers guidelines when measuring accessibility conformance, but also considers the specific features of the device (e.g., screen size, keyboard support) as well as the assistive technology operated by the users. Including these contextual characteristics of the interaction could lead to more faithful measurements of the experience. Finally, Sloan and Kelly [Sloan] claim that understanding accessibility as conformance to guidelines is risky in those countries (e.g., the UK) where accessibility assessment is not limited to guidelines but also focuses on the delivered service and user experience. Therefore, they encourage moving forward and embracing accessibility in terms of user experience and thinking of conformance of the production process, rather than conformance of a product that constantly changes. This perspective is novel in that it looks beyond the current conformance paradigm and aims to tap more into the user experience, and this is something that is not necessarily defined by current methods of technical validation or document conformance.

3.7 Concluding Remarks

The authors of the above papers were queried about some aspects of web accessibility metrics. The first question was about the target users of metrics. The goal of this question was to ascertain whether metrics researchers are targeting specific application scenarios or a specific profile of end users who will utilize the scores provided by metrics. The survey showed that the majority of respondents did not have a specific end user of the metrics in mind, or their answers are too generic. However, three papers are focused on web accessibility benchmarking (see [Nietzio,Battistelli11a, JFernandes]) and some others could be applied in this domain as well. This means that this is the application scenario with broader acceptance and where the application of metrics is taking off.

The second question was to explore whether accessibility metrics researchers are aware of the costs and risk incurred by having incorrect values for metrics. There is some tendency towards employing experts in such validations although most agree that end-users will have the last word as far as validation is concerned. This is closely related to our last question about what is the research community's point of view on measuring accessibility beyond conformance metrics. All answers we received claimed that measuring accessibility in terms of user experience should be explored more thoroughly.

4. A Research Roadmap for Web Accessibility Metrics

This research report aims at highlighting current efforts in investigating accessibility metrics as well as uncovering existing challenges. Research on web accessibility metrics is increasing as the benefits of using them are becoming more apparent. However, their adoption is still poor. This may occur because:

(1) although several metrics have been defined, their evaluations in terms of their strengths and weaknesses are relatively recent [Vigo11a];

(2) quality attributes require further investigation as there are unexplored areas for each of the defined qualities - these areas are uncovered in section 4.1;

(3) the low validity of existing metrics, which calls for a standardized testbed to show how they perform with regard to metrics quality.

Setting up a corpus of web pages for benchmarking purposes could be the first step towards improving this situation. It would work in the same way that the Information Retrieval community does to test the performance of their algorithms [see the Text Retrieval Conference, TREC ] - see section 5.1. A side-effect of the lack of validity and reliability of metrics is their lack of credibility. This could partially be tackled by the mentioned benchmarking corpus. Finally, some other issues such as user-tailored metric and dealing with dynamic content require special attention for those who aim at conducting research on web accessibility metrics.

4.1 Ensuring Metric Quality

The way a metric satisfies validity, reliability, sensitivity, adequacy and complexity qualities remains open and can be addressed by the questions listed in section 4.2. Despite the importance of quality attributes, validity and reliability of metrics should be given priority.

4.2 Addressing Validity

Studies of "validity with respect to conformance" could focus on the following research questions:

The above questions could be addressed in the following ways:

Studies of "validity with respect to accessibility in use" should overcome the evaluator effect [Hornbæk] and lack of agreement of users in their severity ratings [Petrie] and could address the following questions:

4.3 Addressing Reliability

There are a number of combinations (tool × metric) that open new research lines when it comes to metrics reliability. Investigating how a specific metric changes when using different tools (inter-tool reliability was covered in [Vigo11b] and [JFernandes]) or how different metrics behave with the same tool (inter-metric reliability) are some options, to name a few. Specifically, some efforts to understand metric reliability could go in the following direction:

4.4 Addressing Other Quality Attributes

4.4.1 Sensitivity

Experiments could be set up to perform sensitivity analysis: given a set of accessibility problems in a test website, they could be systematically turned on or off, and their effects on metric values could be analyzed to find out which kinds of problems had the largest effect and under which circumstances. Provided that valid and reliable metrics were used, this could tell us which accessibility barriers would have a more or less strong impact on conformance or use.

4.4.2 Adequacy

Provided that a metric is valid and reliable, research directions about metric adequacy should analyze the suitability and usefulness of its values for users in different scenarios, as well as metric visualization and presentation issues.

4.4.3 Complexity

The most important issue about metric complexity relies on its relationship with the rest of the qualities. In this regard we can pose the following questions:

5. Challenges

5.1 A Corpus for Benchmarking Metrics

One option to have a common playground so that the research community could shed some light on these challenges would be to organize the same kind of competitions as the TREC experiments. Some efforts have already been directed towards this goal, for example in the W3C/WAI Test Samples Repository or in the context of the BenToWeb project. There are several issues that need to be tackled.

To start with, pages we know are accessible could be collected, and pages where we know they are not (because we injected faults in them or collected from some other repositories such as www.fixtheweb.net), and ask participants to apply their metrics to such pages and tell us how far apart are the accessible pages from the non-accessible ones. Another option would be to use pages from initiatives such as the one promoted by the W3C/WAI, "BAD: Before and After Demonstration" where for educational purposes, the process of transforming a non-accessible page into an accessible one is shown.

5.2 Credibility issues

Accessibility scores are a great way to communicate the accessibility level of web pages. However, metrics can turn out to be a double-edged sword: while they enhance comprehension, they also hide relevant information and qualitative details of website accessibility. This side effect can motivate metrics users to choose the most lenient scheme among those metrics that are available. As a result, there is a risk of hindering the credibility and trust of accessibility metrics.

The fact that different evaluation tools yield different results directly affects metric validity and, in particular, metric reliability. The poor reproducibility of evaluation reports produced by tools and accessibility scores has a side-effect on the perception of individuals in that the web accessibility assessment process can be regarded as being not very credible.

5.3 User-tailored metrics

User-tailored metrics relate to accessibility-in-use and accessibility in terms of conformance in that in both cases the context may be considered. On the former approach, the interpretation of context would be broad: the task, website type, disability, assistive technology employed, etc. While in the latter, the context is set by the success criteria. There is a challenge for the personalization of metrics as not all accessibility barriers and success criteria impact all users in the same way. While some have tried to group guidelines according to their impact in determined user groups, user needs can be so specific that the effect of a given barrier is more closely related to his/her individual abilities and cannot be inferred from the fact that a particular user is identified as having a particular disability. Individual needs may deviate considerably from guidelines (e.g., a motor-impaired individual having more residual physical abilities than the guidelines foresee). There are some research actions that could be taken to improve user-tailored metrics:

5.4 Dealing with dynamic content

Measuring something that changes over time can give different results depending on the magnitude of such changes. Modern web pages are dynamic, changing their content over time. These changes are not always a reaction to user interaction but can also be due to some other factors such as time, device, geographic location, and more. Especially in Rich Internet Applications these updates are frequently provoked by scripting techniques that modify the web content. Therefore, the mark-up gives only few hints to predict the behavior of the actual web content provided to the end-user. Normally, the most appropriate way to assess the current instance of dynamic web content is to retrieve and test its Document Object Model (DOM). Then the subsequent mutations should be monitored and tested. As expected, different instances of the same web page show inconsistent accessibility evaluation results [Fernandes11a]. As a result, if a metric is sensitive enough, it should be able to reflect such changes to the content.

This area calls for research on the frequency of the testing, that is, should pages be tested every time they update or should it be retrieved at sampling intervals? Additionally, there are some other questions: what would be the accessibility score of a determined URL if page updates entail changes in the accessibility? Should an average of all instances be cumulated?

The conformance to WAI-ARIA and the accessibility elements subsumed by HTML5 could also be explored by future accessibility metrics.

5.5 Combining conformance-based accessibility and accessibility in use

Conformance-based accessibility and accessibility in use are different ways of measuring accessibility. The former checks how a web page conforms to a set of guidelines; to do so, automated tools, expert review and user tests are conducted. It is suggested that the latter may employ usability metrics to measure the efficiency and effectiveness of a web page (subjective metrics such as satisfaction are often used too). Each approach has its advantages and disadvantages: guidelines seem to be the only way to systematically operationalize accessibility evaluations. However, it was argued that WCAG 2.0 conformance-based evaluations only catch about 50% of the problems encountered by users [Power]. Due to the high variability in human behavior on the Web, accessibility-in-use metrics demand high number of users to guarantee the external validity of metrics. That is why, in order to mitigate the weaknesses of each approach, both approaches can be combined in a complementary way. Orchestrating conformance-based testing with accessibility-in-use metrics may help to obtain a broader coverage of accessibility problems.

5.6 Combining automated tests and expert review

One way of reducing the inherent lack of validity and reliability of automated metrics is to use experts in the calculation process (see SAMBA [Brajnik07]). The problem of this approach relies on the fact that it becomes semi-automatic. Therefore it introduces delays in those application scenarios defined in section 1.2 that require real-time computation of accessibility scores. Also, it makes it unfeasible to test a vast amount of pages for large-scale scenarios. Innovative solutions that incorporate human judgment in automatic metrics while keeping the essence of automated test (fast response and large amount) will certainly make a difference in the future.

6. Conclusions

This research report introduces web accessibility metrics. They have been defined and specified, the benefits of using them have been highlighted, and some possible application scenarios have been described. Spurred by the growing number of different metrics that are being released, this report also describes quality attributes that a good metric should have. As a result, metrics can be benchmarked according to their validity, reliability, sensitivity, adequacy, and complexity. We believe this framework can help individuals to make decisions on the adoption of existing metrics according to the qualities required from metrics. In this way, there will not be the need to reinvent the wheel and design new metrics if available metrics already fit one's needs.

An online symposium was held to explore how metrics address the above-mentioned qualities and to investigate current efforts to address quality aspects of web accessibility metrics. The symposium provided a partial but concrete snapshot of some of the research activity around this topic. The symposium identified that tool reliability is a recurrent topic in this regard and that lots of research and development are still necessary to develop more mature methods and tools that support metric validity. A conclusion from this research report is that more effort needs to be directed towards investigating the validity and reliability of metrics. Employing metrics whose validity and reliability is questionable is a very risky practice that should be avoided.

One way to hide the inherent complexity of metrics is to provide tools that facilitate their application in an automated or semi-automated way. As mentioned in Section 1.2, this need for automation comes from the necessity of assessing large volumes of data and websites. That is why large scale analysis of accessibility calls for metrics that can easily be deployed and implemented. Some other efforts are targeting specific quality aspects of the Web such as the lexical quality and the compliance to formalized grammars such as DTDs. Finally, an emerging trend aims at measuring accessibility not only pure compliance to guidelines. Since contextual factors play an important role in determining the quality of user experience, accessibility measurement should be able to consider these factors by collecting and including them in the measurement process or by observing the behavior and performance of real users on real settings, as in usability testing. This perspective can be understood as a complementary approach to current accessibility measurement practice.

Based on the needs and gaps that hinder current accessibility measurement we propose a number of research avenues that can help to boost the acceptance and quality of accessibility metrics. Mostly, quality issues of metric validity and reliability need urgent action but there are also some other actions that can help to make metrics more credible and widespread. A common corpus for metrics benchmarking would be a good step in this direction as it could potentially tackle quality and credibility issues at the same time. Dynamic content and user-tailoring aspects can open new research paths that can have strong impact on the quality of assessment practices, methodologies and tools.

7. References

8. Symposium Proceedings

Research Report on Web Accessibility Metrics

This document should be cited as follows:

M. Vigo, G. Brajnik, J. O Connor, eds. Research Report on Web Accessibility Metrics. 
     W3C WAI Research and Development Working Group (RDWG) Notes. (2014)
     Available at: http://www.w3.org/TR/accessibility-metrics-report

The latest version of this document is available at:

http://www.w3.org/TR/accessibility-metrics-report/

A permanent link to this version of the document is:

http://www.w3.org/TR/2012/WD-accessibility-metrics-report-20120830/

A BibTex file is provided containing:

@incollection {accessibility-metrics-report_FPWD,
  author = {W3C WAI Research and Development Working Group (RDWG)},
  title = {Research Report on Web Accessibility Metrics},
  booktitle = {W3C WAI Symposium on Website Accessibility Metrics},
  publisher = {W3C Web Accessibility Initiative (WAI)},
  year = {2012}, month = {August},
  editor = {Markel Vigo and Giorgio Brajnik and Joshue O Connor eds.},
  series = {W3C WAI Research and Development Working Group (RDWG) Notes},
  type = {Research Report},
  edition = {First Public Working Draft},
  url = {http://www.w3.org/TR/accessibility-metrics-report},
}

Contributed Extended Abstract Papers

The links provided in this section, including those in the BibTex files, are permanent; see also the W3C URI Persistence Policy.

@proceedings{accessibility-metrics-proceedings,
     title = {W3C WAI Symposium on Website Accessibility Metrics},
     year = {2011},
     editor = {W3C WAI Research and Development Working Group (RDWG)},
     series = {W3C WAI Research and Development Working Group (RDWG) Symposia},
     publisher = {W3C Web Accessibility Initiative (WAI)},
     url = {http://www.w3.org/WAI/RD/2011/metrics/},
}

9. Acknowledgements

Participants of the W3C WAI Research and Development Working Group (RDWG) involved in the development of this document include: Christos Kouroupetroglou, Giorgio Brajnik, Joshue O Connor, Klaus Miesenberger, Markel Vigo, Peter Thiessen, Shadi Abou-Zahra, Shawn Henry, Simon Harper, Vivienne Conway, and Yeliz Yesilada.

RDWG would also like to thank the chairs and scientific committee members as well as the paper authors of the RDWG online symposium on Website Accessibility Metrics.

This document was developed with support from the WAI-ACT Project.