Research Report on Web Accessibility Metrics

W3C Editors' Draft 8 June 2012

This version:: http://www.w3.org/WAI/RD/2011/metrics/note/ED-metrics-20120608
Latest published version:: none
Latest internal version:: http://www.w3.org/WAI/RD/2011/metrics/note/ED-metrics
Previous published version:: none
Previous internal version:: http://www.w3.org/WAI/RD/2011/metrics/note/ED-metrics-20120530
Editors:: Markel Vigo, University of Manchester; Giorgio Brajnik, University of Udine; Joshue O Connor, NCBI Centre for Inclusive Technology

A BibTex file is provided; see also information on citing and referencing this document.

Abstract

Web accessibility metrics are an invaluable tool for researchers, developers, governmental agencies and end users. Accessibility metrics help to better grasp the accessibility level of websites and are therefore helpful to make decisions based on the scores they produce. Recently, a plethora of metrics have been released; however the validity and reliability of most of these metrics is unknown and those making use of them are taking the risk of using inappropriate metrics. In order to overcome such situation, this note provides a framework that considers validity, reliability, sensitivity, adequacy and complexity as the main qualities that a metric should have.

A symposium was organised to observe how current practice is addressing such qualities. We found that metrics addressing validity issues is scarce although some efforts can be perceived as far as inter tool reliability is concerned. This is something that the research community should be aware of, as we might be making futile efforts by using metrics whose validity and reliability is unknown. The reseach realm is perhaps not mature enough or we do not have the right methods and tools. We therefore try to shed some light on the possible paths that could be taken so that we can reach a maturity point.

Status of this document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This 8 June 2012 Editors Draft deleted content: [First Public Working Draft] of Research Report on Web Accessibility Metrics is intended to be published and maintained as a W3C Working Group Note after review and refinement. The note provides an initial consolidated view of the outcomes of the Website Accessibility Metrics Online Symposium held on 5 December 2011.

The Research and Development Working Group (RDWG) invites discussion and feedback on this draft document by research and practitioners interested in metrics for web accessibility, in particular by participants of the online symposium. Specifically, RDWG is looking for feedback on:

Summaries of the extended abstracts contributed to the online symposium;
Discussion about the state-of-the-art and conclusions drawn in the document;
Related resources that may be useful to the discussion within the document.

Please send comments on this Research Report on Web Accessibility Metrics document by @@@ to @@@ (publicly visible mailing list archive).

Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This document has been produced by the Research and Development Working Group (RDWG, as part of the Web Accessibility Initiative (WAI) International Program Office.

This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. The groups do not expect this document to become a W3C Recommendation. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; this page also include instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.

Introduction
- 1.1 Definition and background
- 1.2 The Benefits of Using Metrics
A Framework for Quality of Accessibility Metrics
Current Research
A Research Roadmap for Web Accessibility Metrics
A Corpus for Metrics Benchmarking
Conclusions
References
Symposium Proceedings
Acknowledgements

1. Introduction

1.1 Definition and background

In the web engineering domain, a metric is a procedure for measuring a property of a web page or website. A metric can be the number of links, the size in KB of a HTML file, the number of users that click on a certain link, or the perceived ease of use of a web page. In the realm of web accessibility, amongst others, a metric can measure the following qualities:

The number of pictures without an alt attribute.
The number of Level A and AA success criteria violations.
The number of possible failure points where accessibility issues can potentially happen (such as the number of images in a page).
The severity of an accessibility barrier.
The time taken to conduct a task.

In order to measure more abstract qualities, more sophisticated metrics are built upon more basic ones. For instance, readability metrics [readability] take into account the number of syllables, words and sentences contained in a document in order to measure the complexity of a text. Similarly, metrics aiming at measuring web accessibility have been built on specific qualities, which can be inherent in a website (such as images with no alt attribute) or observed from human behaviour (e.g., user satisfaction ratings or performance indexes such as number of errors). For instance, the failure-rate metric computes the ratio between the number of accessibility violations of a particular set of criteria over the number of failure points for the same criteria.

As a result of the computation of accessibility metrics, different types of data can be produced:

Ordinal values, like WCAG 2.0 conformance levels (AAA, AA, A), or "accessible"/"non-accessible" scores; these conformance levels can be computed by a metric defined as "a web page is only accessible if all relevant success criteria are met, otherwise it is inaccessible".
Quantitative ratio values such as 0, 175, -15 or 0.38.

Web accessibility can be viewed and defined in different ways [Brajnik08]. One way is to consider whether a web page/website is conformant to a set of principles such as WCAG 2.0 or Section 508. Even if WCAG 2.0 conformance levels are well specified and, as seen above, they are ordinal values, some other metrics could be defined on the basis of success criteria and their sufficient, advisory and failure techniques. We call these metrics, which are based on whether success criteria of given guidelines are met, conformance-based metrics.

Other metrics can be defined if one assumes that accessibility is a quality that differs from conformance. For example, Section 508 defines accessibility as the extent to which "a technology [...] can be used as effectively by people with disabilities as by those without it". Provided that effectiveness can be measured, such metrics could yield results that differ from conformance-based ones. Analogous to the notion of "quality in use" for software, we call these accessibility-in-use metrics to emphasise that they try to measure performance indexes that can be shown by real users when using the website in specific situations. In addition, they do not require the notion of conformance with respect to a set of principles. Traditional usability metrics such as effectiveness, efficiency and satisfaction could be considered accessibility-in-use metrics. Also, any measure of the perceived accessibility of a web page by users is a metric belonging to this second group. Notice that this notion of accessibility covers not only accessibility of the content of web pages, but also accessibility of user agents, features of assistive technologies, and could even address different levels of expertise that users have with these resources.

Most of the existing metrics - see a review in [Vigo11a] - are of the former type because they are mainly built upon criteria implemented by automatic testing tools such as the number of violations or their WCAG priority. Moreover, in order to overcome the lack of sensitivity and precision of ordinal metrics, conformance metrics often yield ratio scores. The main reason for the widespread use of these types of metrics is their their low cost in terms of time and human resources since they are based on automatic tools. Although no human intervention (experts' audits or user tests) is required in the process, this does not necessarily mean that only fully automated success criteria are to be considered. Some metrics estimate the violation rate of semi-automatic success criteria and purely manual ones like in [Vigo07]; some others adopt an optimistic vs. conservative approach on their violation rate [Lopes].

The error-rate of these estimations, due to their reliance on automated testing tools are the major weaknesses of automatic conformance metrics. In fact, these metrics inherit tool shortcomings such as false positives and false negatives, that affect their outcome [Brajnik04].

A benchmarking survey on automatic conformance metrics concluded that existing metrics are quite divergent and most of them do not do a good job in distinguishing accessible pages from non-accessible pages [Vigo11a]. On the other hand, there are metrics that combine testing tool metrics and those produced by human review, with the goal of estimating such errors; one example is SAMBA [Brajnik07]. Other metrics do not rely on tools at all; an example is the evaluation done with the AIR method [AIR] .

1.2 The Benefits of Using Metrics

There are several scenarios that could benefit from web accessibility metrics:

Quality assurance within web engineering can exploit metrics as a way for developers to better understand the accessibility level of their artifacts throughout the development cycle.
Benchmarking can exploit metrics as a way to explore, at a high-scale, the accessibility level of web pages, such as within a domain (like .gov) or within geographical areas (like in different European states).
Information retrieval systems can implement metrics as one of the criteria to rank web pages. Therefore users would be able to retrieve not only pages that suit their information needs but also those that are accessible.
Adaptive hypermedia techniques (e.g. adaptive naviation support) can make us of metrics to provide guidance or as a criteria to perform interface adaptations.

2. A Framework for Quality of Accessibility Metrics

Several quality factors can be defined for web accessibility metrics, factors that can be used to assess how applicable a metric is in a certain scenario and potentially, how to characterize the risks inherent in the use of a given metric. As discussed in [Vigo11a], validity, reliability, sensitivity, adequacy and complexity appear to be the most important factors.

2.1 Validity

This attribute is related to the extent to which the measurements obtained by a metric reflect the accessibility of the website to which it is applied, and this could depend on the notion of accessibility: conformance vs accessibility-in-use. The former refers to how a web document meets specific criteria (i.e, principles and guidelines), whereas the latter indicates how the interaction is perceived. These two perspectives are not necessarily the same which can be illustrated as follows: a picture without alternative text violates a guideline making a web page non-confomant; however, the lack of alternative text may not be perceived as an obstacle if the goal of the user is to navigate or even purchase an item in a e-commerce website.

As discussed above, most existing conformance metrics are plagued by their reliance on automatic testing tools and do not provide means to estimate the error rate of tools. Furthermore, the way the metric itself is defined could lead to other sources of errors, reducing its validity. For example, the failure rate should not be used as a measure of accessibility-in-use; using it as a measure of conformance is also controversial: it is sometimes claimed that it measures how well developers coped with accessibility features rather than providing an estimation of conformance [Brajnik11]. Validity with respect to accessibility-in-use should cope with the evaluator effect [Hornbæk] and lack of validity of users in their severity ratings [Petrie].

Validity is by far the most important quality attribute for accessibility metrics. Without it we would not know what a metric really measures. The risk of not being able to characterize validity of metrics is that potential users of metrics would choose those that appear easy to employ and that provide seemingly plausible results. In a sense, people may therefore choose a metric because it is simple rather than because it is a good metric, with the unforeseen consequence that incorrect claims and decisions could be made regarding webpages and websites. These are important issues as they strike at the heart of our notions of conformance. We are assessing the validity of a user interface without knowing if our method of assessment is actually valid itself.

2.2 Reliability

This attribute is related to the reproducibility and consistency of scores, i.e. the extent to which they are the same when evaluations of the same web pages are carried out in different contexts (different tools, different people, different goals, different time). Reliability of a metric depends on several layers that are interconnected. These range from the underlying tools (what happens if we switch tools?), to underlying guidelines (what happens if we switch guidelines?), to the evaluation process itself (if random choices are made, for example when scanning a large website).

The inherent inconsistency of unreliable metrics limits the ability of people to predict metric behavior; also, metrics limit the ability to be comprehended at a deeper level. However, reliability will not always be necessary. For instance, if we switch guideline sets we should not expect similar results as a different problem coverage is assumed.

It is worth noting that one of the aims of this research report is to help identify errors, or spot gaps in current metrics. The idea is that we can thereby confidently reject faulty metrics, or improve them in order to halt a process of "devaluation". This devaluation happens in the mind of the end user, in terms of the perceived value of the "ideal" of conformance. This process can be a by-product of poor metrics themselves or come from misunderstanding the output from metrics that are not clear or easy for end users to understand. In other words, if a metric is not stable, it is very difficult to effectively use it as a tool of either analysis or comprehension.

2.3 Sensitivity

Metric sensitivity is a measure of how changes in metric output are reflected in actual changes to any given website. Ideally we would like metrics not to be too sensitive so that they are robust and not over-reacting to small changes in web content. This is especially important when the metric is applied to highly dynamic websites as we show later in this note.

2.4 Adequacy

This is a general quality, encompassing several properties of accessibility metrics, for instance: the type of data used to represent scores, the precision in terms of the resolution of a scale, normalization, the span covered by actual values of the metric (distribution). These attributes determine if the metric can be suitably deployed in a given scenario. For example, to be able to compare accessibility levels of different websites (as would happen in the large scale scenario discussed above) metrics should provide normalized values as otherwise comparisons are not viable. If the distribution of values of the metric is concentrated on a small interval (such as between 0.40 and 0.60, instead of [0, 1]), large changes in accessibility could lead to small changes in the metric; roundoff errors could influence the final outcomes.

2.5 Complexity

Depending on the type and quantity of different data and the algorithm that is used to compute a metric, the process can be more or less computationally demanding with respect to certain resources, such as time, processors, bandwidth, memory. Therefore the complexity of a metric reflects the computational and human resources that prevent stakeholders from embracing accessibility metrics. Some scenarios rely on the fact that metrics have to be relatively simple (such as when metrics are used for adaptations of the user interface, and must be computed on the fly). However, some metrics may require high bandwidth to crawl large websites, large storage capacity or increased computing power. For those metrics that rely on human judgment, another complexity aspect is related to the workflow process that has to be established to resolve conflicts and synthesize a single value. As a result, these metrics may not suit particular application scenarios, budgets or resources.

3. Current Research

The papers that were presented at the symposium cover a broad span of issues addressing the quality factors we outlined above to different extents. However, they provide new insights and ask new questions that help shaping future research avenues (see section 4).

3.1 Addressing Validity and Reliability

Validity in terms of conformance was tackled by Vigo et al. [Vigo11b] by comparing automatic accessibility scores with the ones given by a panel of experts, obtaining a strong positive correlation. Inter-tool reliability of metrics was also addressed by comparing the behaviour of the WAQM metric assessing 1500 pages with two different tools (EvalAccess and LIFT). A very strong correlation was found when pages were ranked according to their scores; but to obtain the same effect with ratio scores the metric requires some ad-hoc adjustment. Finally, the authors investigated inter-guideline reliability between WCAG 1.0 and WCAG 2.0 finding again a very strong correlation between ordinal values although this effect fades out when looking at ratio data.

Fernandes and Benavides [JFernandes] addressed metric reliability (UWEM and web@X) by comparing two tools (eChecker and eXaminator) with a different interpretation of success criteria and coverage, assessing the accessibility of about 300 pages. An initial experiment shows there is a positive moderate correlation between those tools.

Reliability of metrics very often depends on the reliability of the underlying testing tools, and it is well known that different tools produce different results on the same pages. During the webinar it was noted that this problem could lead to situations where low credibility is attributed to tools and metrics; metrics would make it even more difficult to compare different outcomes and diagnose bad behavior. In addition, stakeholders could be tempted to adopt the metrics that provides the best results on their pages, or those that can be more easily interpreted and explained, regardless of whether it is related to accessibility. However, as we mention previously, we should be cautious about when we should expect reliable behaviour across tools, guidelines or domain.

3.2 Tool Support for Metrics

The availability of metrics in terms of publicly available algorithms, APIs or tools is critical for their broad adoption. Providing such mechanisms will help facilitate a broader adoption of metrics by stakeholders - especially by those that, even if interested in using them, do not have the resources to operate and articulate them. There are some incipient proposals in this direction that implement a set of metrics: Naftali and Clúa [Naftali] presented a platform where failure-rate and UWEM are deployed. However this does entail that human intervention is required as the system needs the input of experts to discard false positives. There are some other tools that help to keep track of the accessibility level of websites over time [Battistelli11a]. These sort of tools tend to target the accessibility monitoring of websites within determined geographical locations, normally municipalities or regional governments. The tool support provided by Fernandes et al. [NFernandes11a], QualWeb, incorporates a feature within traditional accessibility testing tools to detect templates; the novelty of this approach is that the metric employed uses the accessibility of the template as a baseline. As a result, accessibility is measured from such starting point. If the accessibility problems of the template were repaired, these fixes would automatically spread to all the pages built upon the template. Therefore, the distance from a particular web page to the template (or baseline) can be used to estimate the effort required to fix this instance, which is very valuable for quality assurance.

3.3 Addressing Large-Scale Measurement

Large scale evaluation and measurement is required for those websites that contain a great deal of pages or when a number of websites have to be evaluated. Managing these large volumes of data cannot be done without the help of automated tools. An example of large websites is provided by Fernandes et al. [NFernandes11a]. They present a method for template detection that aims at lessening the computing effort of evaluating large amounts of pages. This is useful for websites that are substantially built on templates such as on-line stores. In the on-line stores example, normally, the only content that changes is the item to be sold and the related information; however, the layout and internal structure stays the same. One example that contemplates the measurement of the accessibility of large number of distinct websites is depicted by Battistelli et al. [Battistelli11a] using the BIF metric; similarly, AMA is a platform that enables keeping track of a large number of websites which is used to measure how conformant the websites of specific geographical locations are. Finally, Nietzio et al. [Nietzio] present a metric to measure WCAG 2.0 conformance in the context of a platform to keep track of the accessibility of Norwegian

3.4 Targeting Particular Accessibility Issues

Battistelli et al. [Battistelli11a] present a metric to quantify the compliance of documents with respect to their DTDs. Instead of measuring this compliance as if it was a binary variable (conformant/non-confomant), compliance is measured as the distance from the current document to the ideal one. Although its relationship with accessibility is not very apparent, code compliance is one of the technical accessibility requirements according to the Italian regulation and it also impacts on those success criteria that require the correct use of standards [see WCAG SC 4.1.1 Parsing]. Also, this approach could be followed to measure accessibility. For instance, a web page could be improved until it was accessible according to guidelines or until it provides an acceptable experience to end users. The accessibility level of the non-accessible page could be computed in terms of the effort required to build the ideal web page in terms of coding lines, mark-up tags introduced or removed, or time. Another approach that tackles a particular accessibility problem is addressed by Rello and Baeza-Yates [Rello] who address the measurement of text legibility. This is something that affects the understandability of a document, a fundamental accessibility principle [see the Understandable principle]. The interesting contribution of this work is its reliance on a quantitative model of spelling errors automatically computed from a large set of pages handled by a search engine. Compliance with the DTD and legibility of a web document can be considered not only accessibility success criteria but also quality issues.

3.5 Novel Measurement Approaches

When it comes to innovative ways of measuring, the distance from a given document to a reference model can inspire similar approaches to measure web accessibility. As suggested by [Battistelli11b], compliance can be measured by considering the distance between a given document and an ideal (or acceptable) one. In this case this distance can be measured, for instance, in terms of missing hypertext tags or effort required to accomplish changes. Another example is illustrated by measuring the distance from a instance document to a baseline template using a metric [NFernandes11a]. Another novel way of measuring accessibility can be by using a grading scale and an arbitration process, as proposed by Fischer and Wyatt [Fischer]: the use of a five-point Likert scale aims at going beyond a binary accessible/non-accessible scoring scale. It would be interesting to see, in the future, how the final outcome of an evaluation depends on the original scores given by individual evaluators and what level of agreement exists between evaluators before arbitration takes place.

Vigo [Vigo11c] proposes a method by which, depending on the context, the number of checkpoints to be met changes. Nietzio et al. [Nietzio] suggest a stepwise method to measure conformance to WCAG 2.0, where aspects of success criteria applicability or tool support are considered. Such a method adapts to the specific testing procedures of WCAG 2.0 success criteria (SC) by providing a set of decision rules: first, the applicability of SC is analysed; second, if applicable, the SC is tested; third, if a common failure is not found, the implementation of the sufficient techniques is checked; and finally, tool support is checked for the techniques identified in the previous step. The metric computed as a result of this process is a failure rate that takes into account also the logic underlying necessary, sufficient and counter-example techniques for each SC.

3.6 Beyond Conformance

Vigo [Vigo11c] proposes a method that not only considers guidelines when measuring accessibility conformance, but also considers the specific features of the device (e.g., screen size, keyboard support) as well as the assistive technology operated by the users. Including these contextual characteristics of the interaction could lead to more faithful measurements of the experience. Finally, Sloan and Kelly [Sloan] claim that understanding accessibility as conformance to guidelines is risky in those countries (e.g., the UK) where accessibility assessment is not limited to guidelines but also focuses on the delivered service and user experience. Therefore, they encourage moving forward and embracing accessibility in terms of user experience and thinking of conformance of the production process, rather than conformance of a product that constantly changes. This perspective is novel in that it looks beyond the current conformance paradigm and aims to tap more into the user experience, and this is something that is not necessarily defined by current methods of technical validation or document conformance.

3.7 Concluding Remarks

The authors of the above papers were inquired about some aspects of web accessibility metrics. The first aspect is about the target users of metrics; the goal of this question is to ascertain whether metrics researchers have in mind application scenarios or the profile of the end user who will make decisions based on the scores provided by metrics. Our survey shows that the majority of respondents do not have in mind a specific end user of metrics, or their answers are too generic. However, three papers are focused on web accessibility benchmarking (see [Nietzio, Battistelli11a, JFernandes]) and some others could be applied in this domain. This means that this is the application scenario with broader acceptance and where the application of metrics is taking off. In the remaining scenarios (quality assurance, information retrieval and adaptive web) there are also potential applications although the intent of applying in these scenarios is not evident.

Second, we wanted to know whether accessibility metrics researchers are aware of the costs and risk incurred by having incorrect values for metrics. Most users consider that validity and reliability of metrics should be guaranteed although many contemplate it as future work. There is some tendency towards employing experts in such validations although most agree that users will have the last word as fas as validation is concerned. This is closely related to our last question about what is the research community's point of view on measuring accessibility beyond conformance metrics. All answers we received claimed that measuring accessibility in terms of user experience should be explored more thoroughly.

4. A Research Roadmap for Web Accessibility Metrics

This research report aims at highlighting current efforts in investigating accessibility metrics as well as uncovering existing challenges. Research on web accessibility metrics is taking off as the benefits of using them are becoming apparent; however, their adoption is far from being widespread. In addition to their relative novelty, this may occur because (1) there are a plethora of metrics out there and frameworks for metrics comparison that show their strengths and weakness are relatively recent [Vigo11a]; (2) quality frameworks require further investigation as there are unexplored areas for each of the defined qualities - these areas are uncovered in section 4.1; (3) the low validity of existing metrics, which calls for a standardized testbed to show how they perform with regard to metrics quality. Setting up a corpus of web pages for benchmarking purposes could be the first step towards this goal. It would work in the same way that the Information Retrieval community does to test the performance of their algorithms [see the Text Retrieval Conference, TREC ] - see section 4.2. A side-effect of the lack of validity and reliability of metrics is their lack of credibility. This could partially be tackled by the mentioned benchmarking corpus. However the credibility problem goes beyond - see section 4.3. Finally, some other issues such as user-tailored metric and dealing with dynamic content require special attention for those who aim at conducting research on web accessibility metrics.

4.1 Ensuring Metric Quality

To be more precise and focusing on investigating accessibility metric quality there are still many challenges to pursue. The way a metric satisfies validity, reliability, sensitivity, adequacy and complexity qualities remains open and can be addressed by the following questions. Even if all qualities are important, we emphasize that validity and reliability of metrics should be given priority. It does not matter how sensitive or adequate a metric is, if we cannot ensure its reliability and especially validity.

4.2 Validity

Studies of "validity with respect to conformance" could focus on the following research questions:

Does validity of the metric change when we change guidelines?
Does validity change when we use a subset of the guidelines?
Does validity depend on the genre of the website?
Is validity dependent on the type of data being provided by the testing tool?
Does validity change when we switch the tool used to collect data? And what if we use data produced by merging results of two or more tools, rather than basing the metric on the data of a single tool?
Are there quick ways to estimate validity of a metric?

The above questions could be addressed in the following way:

By a panel of judges that would systematically evaluate all the pages using the same guidelines used by the tool(s).
By artificially seeding web pages with known accessibility problems (i.e. violations of guidelines), and systematically investigate how these known problems affect the metric scores.
By exploring the impact on validity of manual tests when (1) they are excluded or (2) their effect is estimated.

Studies of "validity with respect to accessibility in use" should overcome the evaluator effect [Hornbæk] and lack of agreement of users in their severity ratings [Petrie] and could address the following questions:

Which factors affect this type of validity?
Is it possible to estimate validity of the metric from other information that can be easily gathered?
Is validity with respect to accessibility in use related to validity with respect to conformance?

4.3 Reliability

Some efforts to understand metric reliability could go in the following direction:

How results produced by different tools vary when applied to the same website?
Study the differences in the metric scores when metrics are fed with data produced by the same tool on the same websites but when applying different guidelines.
The analysis of the effects of page sampling, a process that is necessary when dealing with large websites or highly dynamic ones.
See how reliability changes when merging the data produced by two or more evaluation tools applied to the same website.
The analysis of how reliability of a metric correlates with its validity.

4.4 Other Qualities

4.4.1 Sensitivity

Experiments could be set up to perform sensitivity analysis: given a set of accessibility problems in a test website, they could be systematically turned on or off, and their effects on metric values could be analysed to find out which kinds of problems had the largest effect and under which circumstances. Provided that valid and reliable metrics were used, this could tell us which accessibility barriers would have a more or less strong impact on conformance or use.

4.4.2 Adequacy

Provided that a metric is valid and reliable, research directions about metric adequacy should analyse the suitability and usefulness of its values for users in different scenarios, as well as metric visualization and presentation issues.

4.4.3 Complexity

The most important issue about metric complexity relies on its relationship with the rest of the qualities. In this regard we can pose the following questions:

Does complexity on a metric ensure more valid and reliable results? If so, could we pursue a compromise solution between the degree of maximum complexity in a metric and its minimum validity?
Can we find proxies (e.g. number of pictures in a web page) to predict the accessibility of a web page? As a side effect we could dramatically reduce the complexity on metrics.
The role that the complexity of a metric plays in its adoption and employment could also be another line to follow.

5. A Corpus for Benchmarking Metrics

One option to have a common playground so that the research community could shed some light on these challenges would be to organise the same kind of competitions as the TREC experiments. Recently, some efforts have been directed towards this goal by the W3C or in the context of the BenToWeb project. There are several issues that need to be tackled.

How do we create test collections?
How do we select our test-participants? (Important in cases where the results for given metrics depend significantly on the tester)
Do we make use of existing web pages?
How do we inject accessibility defects in these web pages?
Which criteria do we use to rank metrics?
How do we isolate the metric from the underlying testing tool?
Which factors should influence metrics (e.g., defects per page for a given criterion, defects repetition due to a single defect on a server-side web page template, WCAG severity level, etc.)?
How do we make these outputs accessible to "non-experts"?
How can we "dove tail" the user experience with metrics used in the wild?
How about comparing the results of user tests with accessibility evaluation tools? This would be very interesting in terms of websites that are already borderline or considered inaccessible.

To start with, pages we know are accessible could be collected, and pages where we know they are not (because we injected faults in them or collected from some other repositories such as www.fixtheweb.net), and ask participants to apply their metrics to such pages and tell us how far apart are the accessible pages from the non-accessible ones. Another option would be to use pages from initiatives such as the one promoted by the WAI, "BAD: Before and After Demonstration" where for educational purposes, the process of transforming a non-accessible page into an accessible one is shown.

5.1 Credibility issues

Accessibility scores are a great device to grasp the accessibility level of web pages. However, metrics can turn out to be a double-edged sword: while they enhance comprehension, they can also hide relevant information and details on the accessibility of a page. This side effect can lead end users to choose the most lenient scores among those metrics that are available. As a result, there is a risk of hindering the credibility and trust of accessibility metrics.

The fact that different evaluation tools yield different results directly affects metric validity and, in particular, metric reliability. The poor reproducibility of evaluation reports and accessibility scores has a side-effect on the perception of individuals in that the web accessibility assessment process can be regarded as not very credible.

5.2 User-tailored metrics

There is a challenge for the personalization of metrics as not all success criteria impact all users in the same way. While some have tried to group guidelines according to their impact in determined user groups, user needs can be so specific that the effect of a given barrier is more closely related to his/her individual abilities and cannot be inferred from the fact that a particular user is identified as having a particular disability. Individual needs may deviate considerably from groups guidelines (e.g., a motor-impaired individual having more residual physical abilities than the group guidelines foresee). There are some research actions that could be taken to improve user-tailored metrics:

Users' interaction context could be considered in metrics, encompassing the Assistive Technology (AT) they are using, the specific browser, plug-ins and operating system platform. In this regard, capturing and encapsulating user's context data in a profile would be a priority.
Quantifying the relevance of guidelines: in order to tailor evaluation and measurement to the particular needs of users, accessibility barriers or checkpoint violations should be weighted according to the impact they have on determined user group or individual.
Reasoning over guidelines. This way, variables that metrics normally require (priorities, number of applied guidelines) can be easily extracted and automatically inferred from violated SC.

5.3 Dealing with dynamic content

Measuring something that changes over time can give different results depending of the magnitude of such changes. Modern web pages are dynamic, changing their content over time. These changes are not always a reaction to user interaction but can also be due to some other factors such as time or location. Especially in Rich Internet Applications these updates are frequently provoked by scripting techniques that mutate web contents. Therefore, the mark-up gives few hints to predict the behaviour of a web document. Normally, the most appropriate way to assess the current instance of a dynamic web document is to retrieve and test its DOM; then its subsequent mutations should be monitored and tested. As expected, different instances of a document caused by updates show inconsistent accessibility evaluation results [Fernandes11]. As a result, if a metric is sensitive enough, it should be able to reflect this updates.

This area calls for research on the frequency of the testings, that is, should pages be tested every time they update or should it be retrieved at sampling intervals? Additionally, there are some other questions: what would be the accessibility score of a determined URL if page updates entail changes in the accessibility? Should an average of all instances be

The conformance to WAI-ARIA and the accessibility elements subsumed by HTML5 could also be explored by future accessibility metrics.

6. Conclusions

This research report introduces web accessibility metrics: they have been defined and specified, the benefits of using them have been highlighted and some possible application scenarios have been described. Spurred by the growing number of different metrics that are being released, we present a framework that encompasses the qualities that a good metric should have. As a result, metrics can be benchmarked according to their validity, reliability, sensitivity, adequacy and complexity. We believe this framework can help individuals to make decisions on the adoption of existing metrics according to the qualities required from metrics. In this way, there will not be the need to reinvent the wheel and design new metrics if available metrics already fit one's needs.

A symposium was held in order to check how metrics address the above-mentioned qualities and to keep track of current efforts targeting quality issues of accessibility metrics. The webminar provided a partial, but concrete, snapshot of most of the research activity around this topic We found that tool reliability is a recurrent topic in this regard, and there is still a long way to go in the realm of methods and examples for metric validity, which are rare. The editors of this research report believe that more efforts should be directed to investigate the validity and reliability of metrics. Employing metrics whose validity and reliability is questionable is a very risky practice that should be avoided. We therefore claim that accessibility metrics should be used and designed responsibly.

One way to hide the inherent complexity of metrics is to provide tools that facilitate their application in an automatic or semi-automatic way. This need for automatization comes from the necessity of assessing large volumes of data and websites; that is why large scale analysis of accessibility calls for metrics that can easily be deployed and implemented. Some other efforts are targeting specific quality aspects of the Web such as the lexical quality or the compliance to DTDs. Finally, an emerging trend aims at measuring accessibility not only in pure compliance terms. Since contextual factors play an important role in determining the quality of user experience, accessibility measurement should be able to consider these factors by collecting and including them in the measurement process or by observing the behaviour and performance of real users on real settings a la usability testing. This perspective can be understood as a complementary approach to current accessibility measurement practice.

Based on the needs and gaps that hinder current accessibility measurement we propose a number of research avenues that can help to boost the acceptance and quality of accessibility metrics. Mostly, quality issues of metric validity and reliability need urgent action but there are also some other actions that can help to make metrics more credible and widespread. A common corpus for metrics benchmarking would be a good step in this direction as it could potentially tackle quality and credibility issues at the same time. Dynamic content and user-tailoring aspects can open new research paths that can have strong impact on the quality of assessment practices, methodologies and tools.

7. References

[AIR] Accessibility Internet Rally (AIR). Available at http://www.knowbility.org/v/air/
[Battistelli11a] M. Battistelli, S. Mirri, L.A. Muratori, P. Salomoni (2011) Measuring accessibility barriers on large scale sets of pages. W3C/WAI RDWG Symposium on Website Accessibility Metrics, paper 2. http://www.w3.org/WAI/RD/2011/metrics/paper2/
[Battistelli11b] M. Battistelli, S. Mirri, L.A. Muratori, P. Salomoni (2011) A metrics to make different DTDs documents evaluations comparable. W3C/WAI RDWG Symposium on Website Accessibility Metrics, paper 4. http://www.w3.org/WAI/RD/2011/metrics/paper4/
[Brajnik04] G. Brajnik (2004) Comparing accessibility evaluation tools: a method for tool effectiveness. Universal Access in the Information Society 3(3-4), 252-263, DOI: 10.1007/s10209-004-0105-y
[Brajnik07] G. Brajnik, R. Lomuscio (2007) SAMBA: a semi-automatic method for measuring barriers of accessibility. ASSETS 2007, 43-50, DOI: 10.1145/1296843.1296853
[Brajnik08] G. Brajnik (2008) Beyond Conformance: The Role of Accessibility Evaluation Methods. WISE Workshops 2008, 63-80, DOI: 10.1007/978-3-540-85200-1_9
[Brajnik11] G. Brajnik (2011) The troubled path of accessibility engineering: an overview of traps to avoid and hurdles to overcome. ACM SIGACCESS Accessibility and Computing Newsletter, Issue 100, June 2011.
[Fischer] D. Fischer, T. Wyatt (2011) The case for a WCAG-based evaluation scheme with a graded rating scale. W3C/WAI RDWG Symposium on Website Accessibility Metrics, paper 7. http://www.w3.org/WAI/RD/2011/metrics/paper7/
[Hornbæk] K. Hornæk, E. Frœkjær (2008) A study of the evaluator effect in usability testing. Human-Computer Interaction 23 (3), 251-277, DOI: 10.1080/07370020802278205
[JFernandes] J. Fernandes, C. Benavidez (2011) A zero in eChecker equals a 10 in eXaminator: a comparison between two metrics by their scores. W3C/WAI RDWG Symposium on Website Accessibility Metrics, paper 8. http://www.w3.org/WAI/RD/2011/metrics/paper8/
[Lopes] R. Lopes, D. Gomes, L. Carriço. (2010) Web not for all: a large scale study of web accessibility. W4A 2010, article 10, DOI: 10.1145/1805986.1806001
[Naftali] M. Naftali, O. Clúa (2011) Integration of Web Accessibility Metrics into a Semi-Automatic evaluation process. W3C/WAI RDWG Symposium on Website Accessibility Metrics, paper 1. http://www.w3.org/WAI/RD/2011/metrics/paper1/
[NFernandes11a] N. Fernandes, R. Lopes, L. Carriço (2011) A Template-aware Web Accessibility metric. W3C/WAI RDWG Symposium on Website Accessibility Metrics, paper 3. http://www.w3.org/WAI/RD/2011/metrics/paper3/
[Nfernandes11b] N. Fernandes, R. Lopes, L. Carriço (2011) On web accessibility evaluation environments. Proceedings of the International Cross-Disciplinary Conference on Web Accessibility, W4A 2011, article 4. DOI: 10.1145/1969289.1969295
[Nietzio] A. Nietzio, M. Eibegger, M. Goodwin, M. Snaprud (2011) Towards a score function for WCAG 2.0 benchmarking. W3C/WAI RDWG Symposium on Website Accessibility Metrics, paper 11. http://www.w3.org/WAI/RD/2011/metrics/paper11/
[Petrie] H. Petrie, O. Kheir (2007) Relationship between accessibility and usability of web sites. CHI 2007, 397-406, DOI: 10.1145/1240624.1240688
[readability] Readability test. http://en.wikipedia.org/wiki/Readability_test
[Rello] L. Rello, R. Baeza-Yates (2011) Lexical Quality as a Measure for Textual Web Accessibility. W3C/WAI RDWG Symposium on Website Accessibility Metrics, paper 5. http://www.w3.org/WAI/RD/2011/metrics/paper5/
[Vigo07] M. Vigo, M. Arrue, G. Brajnik, R. Lomuscio, J. Abascal (2007) Quantitative metrics for measuring web accessibility. W4A 2007, 99-107, DOI: 10.1145/1243441.1243465
[Vigo11a] M. Vigo and G. Brajnik (2011) Automatic web accessibility metrics: where we are and where we can go. Interacting With Computers 23(2), 137-155, DOI: doi:10.1016/j.intcom.2011.01.001
[Vigo11b] M. Vigo, J. Abascal, A. Aizpurua, M. Arrue (2011) Attaining Metric Validity and Reliability with the Web Accessibility Quantitative Metric. W3C/WAI RDWG Symposium on Website Accessibility Metrics, paper 6. http://www.w3.org/WAI/RD/2011/metrics/paper6/
[Vigo11c] M. Vigo (2011) Context-Tailored Web Accessibility Metrics. W3C/WAI RDWG Symposium on Website Accessibility Metrics, paper 9. http://www.w3.org/WAI/RD/2011/metrics/paper9/
[Sloan] D. Sloan, B. Kelly (2011) Web Accessibility Metrics For A Post Digital World. W3C/WAI RDWG Symposium on Website Accessibility Metrics, paper 10. http://www.w3.org/WAI/RD/2011/metrics/paper10/

8. Symposium Proceedings

Research Report on Web Accessibility Metrics

This document should be cited as follows:

M. Vigo, G. Brajnik, J. O Connor (2012) W3C/WAI Research Report on Web Accessibility Metrics.

The latest version of this document is available at:

http://www.w3.org/TR/accessibility-metrics-report/

A permanent link to this version of the document is:

http://www.w3.org/TR/2012/NOTE-accessibility-metrics-report-201206xx

A BibTex file is provided containing:

@article{w3c_rdwg_webmetrics,
     author = {Markel Vigo and Giorgio Brajnik and Joshue O Connor},
     title = {Research Report on Web Accessibility Metrics},
     journal = {W3C/WAI Research and Development Working Group (RDWG) Notes},
     year = {2012},
     month = {????},
     url = {????},
}

Contributed Extended Abstract Papers

The links provided in this section, including those in the BibTex files, are permanent; see also the W3C URI Persistence Policy.

M. Naftali, O. Clúa. Integration of Web Accessibility Metrics into a Semi-Automatic evaluation process. BibTex file, Slides.

@inproceedings{naftali2011,
     author = {Maia Naftali and Osvaldo Cl\'{u}a},
     title = {Integration of Web Accessibility Metrics into a Semi-Automatic 
       evaluation process},
     booktitle = {W3C/WAI Symposium on Website Accessibility Metrics},
     year = {2011},
     editor = {Markel Vigo, Giorgio Brajnik, Joshue O Connor},
     pages = {article 1},
     url = {http://www.w3.org/WAI/RD/2011/metrics/paper1/},
}

M. Battistelli, S. Mirri, L.A. Muratori, P. Salomoni. Measuring accessibility barriers on large scale sets of pages. BibTex file, Slides.

@inproceedings{battistelli2011a,
     author = {Matteo Battistelli and Silvia Mirri and Ludovico Antonio Muratori 
       and Paola Salomoni},
     title = {Measuring accessibility barriers on large scale sets of pages},
     booktitle = {W3C/WAI Symposium on Website Accessibility Metrics},
     year = {2011},
     editor = {Markel Vigo, Giorgio Brajnik, Joshue O Connor},
     pages = {article 2},
     url = {http://www.w3.org/WAI/RD/2011/metrics/paper2/},
}

N. Fernandes, R. Lopes, L. Carriço. A Template-aware Web Accessibility metric. BibTex file.

@inproceedings{nfernandes2011,
     author = {N\'{a}dia Fernandes and Rui Lopes and Lu\'{i}s Carri\c{c}o},
     title = {A Template-aware Web Accessibility metric},
     booktitle = {W3C/WAI Symposium on Website Accessibility Metrics},
     year = {2011},
     editor = {Markel Vigo, Giorgio Brajnik, Joshue O Connor},
     pages = {article 3},
     url = {http://www.w3.org/WAI/RD/2011/metrics/paper3/},
}

M. Battistelli, S. Mirri, L.A. Muratori, P. Salomoni. A metrics to make different DTDs documents evaluations comparable. BibTex file, Slides.

@inproceedings{battistelli2011b,
     author = {Matteo Battistelli and Silvia Mirri and Ludovico Antonio Muratori 
       and Paola Salomoni},
     title = {A metrics to make different DTDs documents evaluations comparable},
     booktitle = {W3C/WAI Symposium on Website Accessibility Metrics},
     year = {2011},
     editor = {Markel Vigo, Giorgio Brajnik, Joshue O Connor},
     pages = {article 4},
     url = {http://www.w3.org/WAI/RD/2011/metrics/paper4/},
}

L. Rello, R. Baeza-Yates. Lexical Quality as a Measure for Textual Web Accessibility. BibTex file, Slides.

@inproceedings{rello2011,
     author = {Luz Rello and Ricardo Baeza-Yates},
     title = {Lexical Quality as a Measure for Textual Web Accessibility},
     booktitle = {W3C/WAI Symposium on Website Accessibility Metrics},
     year = {2011},
     editor = {Markel Vigo, Giorgio Brajnik, Joshue O Connor},
     pages = {article 5},
     url = {http://www.w3.org/WAI/RD/2011/metrics/paper5/},
}

M. Vigo, J. Abascal, A. Aizpurua, M. Arrue. Attaining Metric Validity and Reliability with the Web Accessibility Quantitative Metric. BibTex file, Slides.

@inproceedings{vigo2011a,
     author = {Markel Vigo and Julio Abascal and Amaia Aizpurua and Myriam Arrue},
     title = {Attaining Metric Validity and Reliability with the Web Accessibility 
       Quantitative Metric},
     booktitle = {W3C/WAI Symposium on Website Accessibility Metrics},
     year = {2011},
     editor = {Markel Vigo, Giorgio Brajnik, Joshue O Connor},
     pages = {article 6},
     url = {http://www.w3.org/WAI/RD/2011/metrics/paper6/},
}

D. Fischer, T. Wyatt. The case for a WCAG-based evaluation scheme with a graded rating scale. BibTex file, Slides.

@inproceedings{fischer2011,
     author = {Detlev Fischer and Tiffany Wyatt},
     title = {The case for a WCAG-based evaluation scheme with a 
       graded rating scale},
     booktitle = {W3C/WAI Symposium on Website Accessibility Metrics},
     year = {2011},
     editor = {Markel Vigo, Giorgio Brajnik, Joshue O Connor},
     pages = {article 7},
     url = {http://www.w3.org/WAI/RD/2011/metrics/paper7/},
}

J. Fernandes, C. Benavidez. A zero in eChecker equals a 10 in eXaminator: a comparison between two metrics by their scores. BibTex file, Slides.

@inproceedings{jfernandes2011,
     author = {Jorge Fernandes and Carlos Benavidez},
     title = {A zero in eChecker equals a 10 in eXaminator: a comparison 
       between two metrics by their scores},
     booktitle = {W3C/WAI Symposium on Website Accessibility Metrics},
     year = {2011},
     editor = {Markel Vigo, Giorgio Brajnik, Joshue O Connor},
     pages = {article 8},
     url = {http://www.w3.org/WAI/RD/2011/metrics/paper8/},
}

M. Vigo. Context-Tailored Web Accessibility Metrics. BibTex file, Slides.

@inproceedings{vigo2011b,
     author = {Markel Vigo},
     title = {Context-Tailored Web Accessibility Metrics},
     booktitle = {W3C/WAI Symposium on Website Accessibility Metrics},
     year = {2011},
     editor = {Markel Vigo, Giorgio Brajnik, Joshue O Connor},
     pages = {article 9},
     url = {http://www.w3.org/WAI/RD/2011/metrics/paper9/},
}

D. Sloan, B. Kelly. Web Accessibility Metrics For A Post Digital World. BibTex file.

@inproceedings{sloan2011,
     author = {David Sloan and Brian Kelly},
     title = {Web Accessibility Metrics For A Post Digital World},
     booktitle = {W3C/WAI Symposium on Website Accessibility Metrics},
     year = {2011},
     editor = {Markel Vigo, Giorgio Brajnik, Joshue O Connor},
     pages = {article 10},
     url = {http://www.w3.org/WAI/RD/2011/metrics/paper10/},
}

A. Nietzio, M. Eibegger, M. Goodwin, M. Snaprud. Towards a score function for WCAG 2.0 benchmarking. BibTex file, Slides.

@inproceedings{nietzio2011,
 author = {Annika Nietzio and Mandana Eibegger and Morten Goodwin and Mikael Snaprud},
 title = {Towards a score function for WCAG 2.0 benchmarking},
 booktitle = {W3C/WAI Symposium on Website Accessibility Metrics},
 year = {2011},
 editor = {Markel Vigo, Giorgio Brajnik, Joshue O Connor},
 pages = {article 11},
 url = {http://www.w3.org/WAI/RD/2011/metrics/paper11/},
}

9. Acknowledgements

Participants of the W3C/WAI Research and Development Working Group (RDWG) involved in the development of this document include: Christos Kouroupetroglou, Giorgio Brajnik, Joshue O Connor, Klaus Miesenberger, Markel Vigo, Peter Thiessen, Shadi Abou-Zahra, Shawn Henry, Simon Harper, Vivienne Conway, and Yeliz Yesilada.

RDWG would also like to thank the chairs and scientific committee members as well as the paper authors of the RDWG online symposium on Website Accessibility Metrics.

This document was developed with support from the WAI-ACT Project.