Response to the comments to the Research Report on Web Accessibility Metrics working draft of 30 August 2012

Overview

  1. Jump to the list of contributors
  2. Jump to General comments
  3. Jump to Comments for Section 1. Introduction
  4. Jump to Comments for Section 2. A Framework for Quality of Accessibility Metrics
  5. Jump to Comments for Section 3. Current Research
  6. Jump to Comments for Section 4. A Research Roadmap for Web Accessibility Metrics
  7. Jump to Comments for Section 5. A Corpus for Benchmarking Metrics

Contributions

General comments

Table addressing general comments
ID Name Status Suggested change Comment Resolution
1 Diana Kornbrot Open automatic tests Out of scope No action taken
2 Diana Kornbrot Open web addresses for protocols for user testing Out of scope No action taken
3 Diana Kornbrot Open links to tools that generate HTML5 and CSS3 conformant pages Out of scope No action taken
4 Diana Kornbrot Open templates conforming HTML5 and CSS3 Out of scope No action taken
5 Francois Junique Open not always very clearly separating them [conformance vs accessibility] and neither really discussing how they could be possibly combined. Agreed on emphasising these differences more. We made some change in section 1.1 and created subsection 5.5 to address such issue.
6 Francois Junique Open Regarding the accessibility ones [metrics] it doesn't apparently discuss the indicators developed by some teams regarding accessibility per major type of disabilities (including their severity level and predominance in populations which are key issues for stakeholders/policy- makers decisions). While we understand that some references and metric properties may be missed, it is not the intention of the research note to run a comprehensive survey, but to propose new research avenues. There are some other papers that analyse and compare all the relevant contributions regarding web accessibility metrics. See for instance, Automatic web accessibility metrics: where we are and where we can go. No action taken
7 David Sloan Open See email. We see the comments as an endorsements to the research note. No action taken

Comments for Section 1. Introduction

Table addressing comments for section 1
ID Name Status Suggested change Comment Resolution
8 Annika Nietzio Open In the beginning of this section the concepts "metric" and "indicator" got mixed up. I'd suggest to use the term "indicator" to refer to single dimensions that can be assessed objectively (such as number of pictures, violations, etc). Maybe you mean the same, when you refer to "basic metrics". In my opinion, a metric includes the combination of several indicators using different mathematical operations, weigthing parameters etc. - exactly as in your example (readability metrics). To us an indicator is a proxy metric/score/value that predicts some quality. Regarding metrics, there are simple metrics and more complex metrics. In any case we don't think it is confusing right now. Maybe we could be more precise but also the consequences are not so critical. No action taken.
9 Annika Nietzio Open The item "The severity of an accessibility barrier." doesn't fit in the list because it the not an indicator (at least I don't know who it could be measured objectively). Maybe it cannot be measured objectively. We think it can be; another thing is whether it is generalisable to people of similar abilities in similar contexts of use but it still fits for us. No action taken.
10 Annika Nietzio Open The list "different types of data can be produced" mixes "ordinal values" and "conformance levels". These should be distiguished: (a)Conformance levels (AAA, AA, A) have a fixed frame of reference (WCAG). It is possible to determine the conformance level of a single web site. (b) Ordinal values (ordinal means "ordered") refer to something like a ranking, i.e. you can compare two web sites, determine which one is better, but not necessarily to which extent is a better than another one. It does not make sense to compute an ordinal value for a single site. We think that this definition gives for ordinal values perfectly applies to WAI conformance levels. Regarding, "it does not make sense to compute an ordinal value for a single site" conformance levels allow to locate a page in the AAA, AA, A, 0 ranking. No action taken.
11 Annika Nietzio Open As a side note: There are other mathematical properties of the results that could be interesting such as "bounded vs unbounded". Agreed. However, this goes beyond the scope of the research note. We would like not to overlap with our Interacting with Computers paper, where we deal with all these issues. No action taken. See comment and resolution #6
12 Annika Nietzio Open Make it more explicit that metrics are not the same as automated testing. Discuss the benefits and disadvantages in the same section. Agree. We should be more explicit. (1) We have replaced "tools" with "automated testing". [Section 1.1, last paragraph]; and (2) Section 1.2 has been updated.

Comments for Section 2. A Framework for Quality of Accessibility Metrics

Table addressing comments for section 2
ID Name Status Suggested change Comment Resolution
13 Francois Junique Open Regarding the characteristics selected, I don't know how the one selected correspond to those used in the EU for general impact assessment indicators (RACER): relevant, accepted, credible, unambiguous, easy to interpret, easy to monitor, robust Our qualities are based on psychometrics qualities. We highlight why we analyse ours by citing O'Donnell and Eggemeier.
14 Francois Junique Open probably useful is the facility to combine indicators when aggregating information (eg from sub-sites of a portal, from sites in a geographic region or a activity domain). Also the issue of the possibility or not to compare indicators between sites (or groups of sites) and therefore possibly draw order (along some of the indicator dimensions) between sites (or groups of sites). We understand the point made although we are more in favour of aggregating indicators of sampled pages rather than a massive aggregation of all the pages in a site as it seems to be suggested. The note is not providing solutions. It's about a roadmap. And in fact what is being suggested here was already partly done on our Interacting with Computers paper. We believe this is related to sampling issues. We add some comment on this at the end of section 1.2.
15 Francois Junique Open One aspect which appears also missing is a discussion on the minimum number of dimensions a "good" (according to the retained characteristics) indicator should have. Both in terms of criteria or accessibility issues covered (e.g. one score-card value is probably much too simplistic for a site) and in term of statistical representativity (e.g. extracted from an identification of the type of distribution of the issues diagnosed) Agreed, fair point. We provide a pointer in section 4.1 to how some qualities are required, desirable and optional depending on the application scenario.
16 Sarah Bourne Open the evaluation criteria outlined are going to be dependent on what the metrics are going to be used for. For instance, what is "adequate" for a high-level snapshot of the state of accessibility in an enterprise may be almost useless for measuring code quality in a development environment. Completely agree. Again, this is something we covered in our Interacting with Computers paper. I would explicitly mention what Sarah says; that is, qualities' relevance depends on the application scenarios or use cases. The first paragraph of section 2 addresses this. Similarly, resolution #15 also addressed this comment.
17 Wayne Dick Open Though not in your symposium, I think a key parameter is efficacy. Efficacy is not validity. Efficacy measures whether an intervention actually treats the impairment that it is intended to treat. Also, for whom does it work, and for whom does it fail, and hopefully, why. Wayne is talking about efficacy of users trying to achieve a goal, which is by itself a metric one could use for accessibility. If done properly (such tests, such measurements) then the resulting metric would be valid (in the context in which it was employed). Thus validity!=efficacy, but we would like to refrain from listing all the possible metrics that could be used to measure accessibility. No action taken.
18 Wayne Dick Open Scope: Testing every site really is not necessary. That is why we have statistics. I think our time would be better spent exploring statistically valid sampling techniques. Using inferential statistics to draw general conclusions requires sampling on pages/ resources and people and perhaps also tasks to be accomplished. It is way too complex and laboursome than what one would like to do. There are also many ways to sample pages, with different effects on results (see ASSETS 2007 paper on sampling issues). We add some comment on this at the end of section 1.2.
19 Annika Nietzio Open The example (picture without alt) seems to question to validity of WCAG. The goal of the guidelines is the describe accessibility for the widest possible range of users. How can the definition of users in "accessibility- in-use" address this issue? Agreed. We provide some examples. First paragraph of section 2.1 has been rewritten.
20 Annika Nietzio Open It should say: "how changes in a given website are reflected in the metric output". The web site can not reflect the metric because it is independent of it. The metric could give a value which is a property of that site. If we compare that site with others then these two or more values "reflect" the accessibility of sites. Maybe our use of "reflect" is only a linguistic elaboration of academic nature. No action taken.

Comments for Section 3. Current Research

Table addressing comments for section 3
ID Name Status Suggested change Comment Resolution
21 Annika Nietzio Open "counter-example techniques" -> "common failures" Agreed. Fixed

Comments for Section 4. A Research Roadmap for Web Accessibility Metrics

Table addressing comments for section 4
ID Name Status Suggested change Comment Resolution
22 Annika Nietzio Open "Conformance" can not be viewed independent of the requirements to which conformance is claimed. That means that "validity with respect to conformance" is directly related to "validity of the requirements". But validity of requirements (or guidelines) is clearly beyond the scope of this TR. How can this research question be refined? Validity of requirements is not out of scope of the note, although the note itself does not discusses in depth this issue. However I think we could mention it as yet another direction to pursue, within the topic "validity of metrics". We have added a bulleted point to cover this question in section 4.2.
23 Annika Nietzio Open Question about the first item: In other parts of this report you say that a tool produces data and the metrics calculate the score from this data. So this research question can be interpreted in two ways: (1) compare the results of the same metric applied to the output of different tools. (2) compare the results of different metrics applied to the same tool output. - Both could be interesting. We can mention different reliability modalities such as the ones mentioned by Annika: such inter-tool and inter-metric reliability without going deep into details. More details on this can be found in the Interacting with Computers paper on accessibility metrics. We included these issues in 4.3.
24 Annika Nietzio Open In some parts of the report you say that an easy metric is not necessarily a good metric. This is not the whole truth. Complex metrics (formulae with many unknown parameters such as weights for disability and severity) also cause many problems in terms of parameter estimation and justification. So in these cases simple might be better. When we say that an easy metric is not necessarily a good metric it is not meant that a complex one is a good one. No action taken.

Comments for Section 5. A Corpus for Benchmarking Metrics

Table addressing comments for section 5
ID Name Status Suggested change Comment Resolution
25 Francois Junique Open ...I would suggest complementing the competition with some real and large scale processing (several hundreds of mixed size sites) to detect the real difficulties and to decide appropriate solutions. The note is not providing solutions. It's about a roadmap. And in fact what is being suggested here was already partly done on our Interacting with Computers paper. No action taken.
26 Annika Nietzio Open A comment on tools: software has bugs, which can of course affect validity of the results. A benchmarking corpus could be use to improve the quality of software. Is is important to define if the corpus should consist of labeled or unlabeled examples. And what would the labels be? Binary labels (accessible vs. not accessible) are not sufficient. But on the other hand any more complex definition of labels would be a metric in itself. This is an interesting suggestion. These questions have been included in section 5.1.
27 Annika Nietzio Open It would be heplful to clarify the relationship of "user-tailored metrics" to the concept of "accessibility-in-use" mentioned earlier. Agreed. We have clarified this in the first paragraph of section 5.3.
28 Annika Nietzio Open And finally some input from the discussions during the ICCHP session: A topic that came up several times was that the idea of enhancing automated tests by combining them with expert or user input. This should also be mentioned in the road map. Agreed. We added section 5.6 addressing this issue.