[ contents ]
A BibTex file is provided; see also information on citing and referencing this document .
Copyright © 2012 W3C ® ( MIT , ERCIM , Keio ), All Rights Reserved. W3C liability , trademark and document use rules apply.
Web
accessibility
metrics
are
an
invaluable
tool
for
researchers,
developers,
governmental
agencies
and
end
users.
Accessibility
metrics
help
to
better
grasp
indicate
the
accessibility
level
of
websites
and
are
therefore
helpful
to
make
decisions
based
on
websites,
including
the
scores
they
produce.
accessibility
level
of
individual
websites,
or
even
large-scale
surveys
of
the
accessibility
of
many
websites.
Recently,
a
plethora
of
metrics
have
has
been
released;
however
released
to
complement
the
A,
AA,
and
AAA
Levels
measurement
used
by
the
WAI
guidelines.
However,
the
validity
and
reliability
of
most
of
these
metrics
is
are
unknown
and
those
making
use
of
them
are
taking
the
risk
of
using
inappropriate
metrics.
In
order
to
overcome
such
situation,
address
these
concerns,
this
note
provides
a
framework
that
considers
validity,
reliability,
sensitivity,
adequacy
and
complexity
as
the
main
qualities
that
a
metric
should
have.
A
symposium
was
organised
organized
to
observe
how
current
practice
is
practices
are
addressing
such
qualities.
We
found
that
metrics
addressing
validity
issues
is
are
scarce
although
some
efforts
can
be
perceived
as
far
as
inter
tool
inter-tool
reliability
is
concerned.
This
is
something
that
the
research
community
should
be
aware
of,
as
we
might
be
making
futile
efforts
by
using
metrics
whose
validity
and
reliability
is
are
unknown.
The
reseach
research
realm
is
perhaps
not
mature
enough
or
we
do
not
have
the
right
methods
and
tools.
We
therefore
try
to
shed
some
light
on
the
possible
paths
that
could
be
taken
so
that
we
can
reach
a
maturity
point.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
This
8
June
12
July
2012
Editors
Draft
deleted
content:
[First
First
Public
Working
Draft]
Draft
of
Research
Report
on
Web
Accessibility
Metrics
is
intended
to
be
published
and
maintained
as
a
W3C
Working
Group
Note
after
review
and
refinement.
The
note
provides
an
initial
consolidated
view
of
the
outcomes
of
the
Website
Accessibility
Metrics
Online
Symposium
held
on
5
December
2011.
The Research and Development Working Group ( RDWG ) invites discussion and feedback on this draft document by research and practitioners interested in metrics for web accessibility, in particular by participants of the online symposium. Specifically, RDWG is looking for feedback on:
Please
send
comments
on
this
Research
Report
on
Web
Accessibility
Metrics
document
by
@@@
31
August
2012
to
@@@
public-wai-rd-comments@w3.org
(publicly
visible
mailing
list
archive
).
Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This
document
has
been
produced
by
the
Research
and
Development
Working
Group
(
RDWG
,
as
part
of
the
Web
Accessibility
Initiative
(WAI)
(
WAI
)
International
Program
Office
.
This document was produced by a group operating under the 5 February 2004 W3C Patent Policy . The groups do not expect this document to become a W3C Recommendation. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; this page also include instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy .
The W3C/WAI Web Content Accessibility Guidelines (WCAG) and other WAI guidelines provide discrete conformance levels "A", "AA", and "AAA" to measure the level of accessibility. In many cases more granular scores would help provide a more precise indication on the level of accessibility. However, identifying valid, reliable, sensitive, adequate, and computable metrics that produce such scores is a non-trivial task with several challenges. This research report explores the qualities that such metrics need to demonstrate based on input from an Online Symposium on Website Accessibility Metrics held on 5 December 2011.
In the web engineering domain, a metric is a procedure for measuring a property of a web page or website. A metric can be the number of links, the size in KB of a HTML file, the number of users that click on a certain link, or the perceived ease of use of a web page. In the realm of web accessibility, amongst others, a metric can measure the following qualities:
In
order
to
measure
more
abstract
qualities,
more
sophisticated
metrics
are
built
upon
more
basic
ones.
For
instance,
readability
metrics
[
readability
]
take
into
account
the
number
of
syllables,
words
and
sentences
contained
in
a
document
in
order
to
measure
the
complexity
of
a
text.
Similarly,
metrics
aiming
at
measuring
web
accessibility
have
been
built
on
specific
qualities,
which
can
be
inherent
in
a
website
(such
as
images
with
no
alt
attribute)
or
observed
from
human
behaviour
behavior
(e.g.,
user
satisfaction
ratings
or
performance
indexes
such
as
number
of
errors).
For
instance,
the
failure-rate
metric
computes
the
ratio
between
the
number
of
accessibility
violations
of
a
particular
set
of
criteria
over
the
number
of
failure
points
for
the
same
criteria.
As a result of the computation of accessibility metrics, different types of data can be produced:
Web
accessibility
can
be
viewed
and
defined
in
different
ways
[
Brajnik08
].
One
way
is
to
consider
whether
a
web
page/website
is
conformant
to
a
set
of
principles
requirements
such
as
those
defined
by
WCAG
2.0
or
by
Section
508.
508
.
Even
if
WCAG
2.0
conformance
levels
are
well
specified
and,
as
seen
above,
they
are
ordinal
values,
some
other
metrics
could
be
defined
on
the
basis
of
success
criteria
and
their
sufficient,
advisory
and
failure
techniques.
We
call
these
metrics,
which
are
based
on
whether
success
criteria
of
given
guidelines
are
met,
conformance-based
metrics
.
Other
metrics
can
be
defined
if
one
assumes
that
accessibility
is
a
quality
that
differs
from
conformance.
For
example,
the
US
federal
procurement
policy
known
as
Section
508
defines
accessibility
as
the
extent
to
which
"a
technology
[...]
can
be
used
as
effectively
by
people
with
disabilities
as
by
those
without
it".
Provided
that
effectiveness
can
be
measured,
such
metrics
could
yield
results
that
differ
from
conformance-based
ones.
Analogous
to
the
notion
of
"quality
in
use"
for
software,
we
call
these
accessibility-in-use
metrics
to
emphasise
emphasize
that
they
try
to
measure
performance
indexes
that
can
be
shown
by
real
users
when
using
the
website
in
specific
situations.
In
addition,
they
do
not
require
the
notion
of
conformance
with
respect
to
a
set
of
principles.
Traditional
usability
metrics
such
as
effectiveness,
efficiency
and
satisfaction
could
be
considered
accessibility-in-use
metrics.
Also,
any
measure
of
the
perceived
accessibility
of
a
web
page
by
users
is
a
metric
belonging
to
this
second
group.
Notice
that
this
notion
of
accessibility
covers
not
only
accessibility
of
the
content
of
web
pages,
but
also
accessibility
of
user
agents,
features
of
assistive
technologies,
and
could
even
address
different
levels
of
expertise
that
users
have
with
these
resources.
Most
of
the
existing
metrics
-
see
a
review
in
[
Vigo11a
]
-
are
of
the
former
type
because
they
are
mainly
built
upon
criteria
implemented
by
automatic
automated
testing
tools
such
as
the
number
of
violations
or
their
WCAG
priority.
Moreover,
in
order
to
overcome
the
lack
of
sensitivity
and
precision
of
ordinal
metrics,
conformance
metrics
often
yield
ratio
scores.
The
main
reason
for
the
widespread
use
of
these
types
of
metrics
is
their
their
low
cost
in
terms
of
time
and
human
resources
since
they
are
based
on
automatic
automated
tools.
Although
no
human
intervention
(experts'
audits
or
user
tests)
is
required
in
the
process,
this
does
not
necessarily
mean
that
only
fully
automated
success
criteria
are
to
be
considered.
Some
metrics
estimate
the
violation
rate
of
semi-automatic
semi-automated
success
criteria
and
purely
manual
ones
like
in
[
Vigo07
];
some
others
adopt
an
optimistic
vs.
conservative
approach
on
their
violation
rate
[
Lopes
].
The
error-rate
of
these
estimations,
due
to
their
reliance
on
automated
testing
tools
are
the
major
weaknesses
of
automatic
automated
conformance
metrics.
In
fact,
these
metrics
inherit
tool
shortcomings
such
as
false
positives
and
false
negatives,
that
affect
their
outcome
[
Brajnik04
].
A
benchmarking
survey
on
automatic
automated
conformance
metrics
concluded
that
existing
metrics
are
quite
divergent
and
most
of
them
do
not
do
a
good
job
in
distinguishing
accessible
pages
from
non-accessible
pages
[
Vigo11a
].
On
the
other
hand,
there
are
metrics
that
combine
testing
tool
metrics
and
those
produced
by
human
review,
with
the
goal
of
estimating
such
errors;
one
example
is
SAMBA
[
Brajnik07
].
Other
metrics
do
not
rely
on
tools
at
all;
an
example
is
the
evaluation
done
with
the
AIR
method
[
AIR
]
.
There are several scenarios that could benefit from web accessibility metrics:
Several quality factors can be defined for web accessibility metrics, factors that can be used to assess how applicable a metric is in a certain scenario and potentially, how to characterize the risks inherent in the use of a given metric. As discussed in [ Vigo11a ], validity, reliability, sensitivity, adequacy and complexity appear to be the most important factors.
This
attribute
is
related
to
the
extent
to
which
the
measurements
obtained
by
a
metric
reflect
the
accessibility
of
the
website
to
which
it
is
applied,
and
this
could
depend
on
the
notion
of
accessibility:
conformance
vs
vs.
accessibility-in-use
.
The
former
refers
to
how
a
web
document
meets
specific
criteria
(i.e,
(i.e.,
principles
and
guidelines),
whereas
the
latter
indicates
how
the
interaction
is
perceived.
These
two
perspectives
are
not
necessarily
the
same
which
can
be
illustrated
as
follows:
a
picture
without
alternative
text
violates
a
guideline
making
a
web
page
non-confomant;
non-conformant;
however,
the
lack
of
alternative
text
may
not
be
perceived
as
an
obstacle
if
the
goal
of
the
user
is
to
navigate
or
even
purchase
an
item
in
a
e-commerce
website.
As
discussed
above,
most
existing
conformance
metrics
are
plagued
by
their
reliance
on
automatic
automated
testing
tools
and
do
not
provide
means
to
estimate
the
error
rate
of
tools.
Furthermore,
the
way
the
metric
itself
is
defined
could
lead
to
other
sources
of
errors,
reducing
its
validity.
For
example,
the
failure
rate
should
not
be
used
as
a
measure
of
accessibility-in-use;
using
it
as
a
measure
of
conformance
is
also
controversial:
it
is
sometimes
claimed
that
it
measures
how
well
developers
coped
with
accessibility
features
rather
than
providing
an
estimation
of
conformance
[
Brajnik11
].
Validity
with
respect
to
accessibility-in-use
should
cope
with
the
evaluator
effect
[
Hornbæk
]
and
lack
of
validity
of
users
in
their
severity
ratings
[
Petrie
].
Validity
is
by
far
the
most
important
quality
attribute
for
accessibility
metrics.
Without
it
we
would
not
know
what
a
metric
really
measures.
The
risk
of
not
being
able
to
characterize
validity
of
metrics
is
that
potential
users
of
metrics
would
choose
those
that
appear
easy
to
employ
and
that
provide
seemingly
plausible
results.
In
a
sense,
people
may
therefore
choose
a
metric
because
it
is
simple
rather
than
because
it
is
a
good
metric,
with
the
unforeseen
consequence
that
incorrect
claims
and
decisions
could
be
made
regarding
webpages
web
pages
and
websites.
These
are
important
issues
as
they
strike
at
the
heart
of
our
notions
of
conformance.
We
are
assessing
the
validity
of
a
user
interface
without
knowing
if
our
method
of
assessment
is
actually
valid
itself.
This attribute is related to the reproducibility and consistency of scores, i.e. the extent to which they are the same when evaluations of the same web pages are carried out in different contexts (different tools, different people, different goals, different time). Reliability of a metric depends on several layers that are interconnected. These range from the underlying tools (what happens if we switch tools?), to underlying guidelines (what happens if we switch guidelines?), to the evaluation process itself (if random choices are made, for example when scanning a large website).
The inherent inconsistency of unreliable metrics limits the ability of people to predict metric behavior; also, metrics limit the ability to be comprehended at a deeper level. However, reliability will not always be necessary. For instance, if we switch guideline sets we should not expect similar results as a different problem coverage is assumed.
It is worth noting that one of the aims of this research report is to help identify errors, or spot gaps in current metrics. The idea is that we can thereby confidently reject faulty metrics, or improve them in order to halt a process of "devaluation". This devaluation happens in the mind of the end user, in terms of the perceived value of the "ideal" of conformance. This process can be a by-product of poor metrics themselves or come from misunderstanding the output from metrics that are not clear or easy for end users to understand. In other words, if a metric is not stable, it is very difficult to effectively use it as a tool of either analysis or comprehension.
Metric sensitivity is a measure of how changes in metric output are reflected in actual changes to any given website. Ideally we would like metrics not to be too sensitive so that they are robust and not over-reacting to small changes in web content. This is especially important when the metric is applied to highly dynamic websites as we show later in this note.
This
is
a
general
quality,
encompassing
several
properties
of
accessibility
metrics,
for
instance:
the
type
of
data
used
to
represent
scores,
the
precision
in
terms
of
the
resolution
of
a
scale,
normalization,
the
span
covered
by
actual
values
of
the
metric
(distribution).
These
attributes
determine
if
the
metric
can
be
suitably
deployed
in
a
given
scenario.
For
example,
to
be
able
to
compare
accessibility
levels
of
different
websites
(as
would
happen
in
the
large
scale
scenario
discussed
above)
metrics
should
provide
normalized
values
as
otherwise
comparisons
are
not
viable.
If
the
distribution
of
values
of
the
metric
is
concentrated
on
a
small
interval
(such
as
between
0.40
and
0.60,
instead
of
[0,
1]),
large
changes
in
accessibility
could
lead
to
small
changes
in
the
metric;
roundoff
round
off
errors
could
influence
the
final
outcomes.
Depending on the type and quantity of different data and the algorithm that is used to compute a metric, the process can be more or less computationally demanding with respect to certain resources, such as time, processors, bandwidth, memory. Therefore the complexity of a metric reflects the computational and human resources that prevent stakeholders from embracing accessibility metrics. Some scenarios rely on the fact that metrics have to be relatively simple (such as when metrics are used for adaptations of the user interface, and must be computed on the fly). However, some metrics may require high bandwidth to crawl large websites, large storage capacity or increased computing power. For those metrics that rely on human judgment, another complexity aspect is related to the workflow process that has to be established to resolve conflicts and synthesize a single value. As a result, these metrics may not suit particular application scenarios, budgets or resources.
The papers that were presented at the symposium cover a broad span of issues addressing the quality factors we outlined above to different extents. However, they provide new insights and ask new questions that help shaping future research avenues (see section 4).
Validity
in
terms
of
conformance
was
tackled
by
Vigo
et
al.
[
Vigo11b
]
by
comparing
automatic
automated
accessibility
scores
with
the
ones
given
by
a
panel
of
experts,
obtaining
a
strong
positive
correlation.
Inter-tool
reliability
of
metrics
was
also
addressed
by
comparing
the
behaviour
behavior
of
the
WAQM
metric
assessing
1500
pages
with
two
different
tools
(EvalAccess
and
LIFT).
A
very
strong
correlation
was
found
when
pages
were
ranked
according
to
their
scores;
but
to
obtain
the
same
effect
with
ratio
scores
the
metric
requires
some
ad-hoc
adjustment.
Finally,
the
authors
investigated
inter-guideline
reliability
between
WCAG
1.0
and
WCAG
2.0
finding
again
a
very
strong
correlation
between
ordinal
values
although
this
effect
fades
out
when
looking
at
ratio
data.
Fernandes
and
Benavides
Benavidez
[
JFernandes
]
addressed
metric
reliability
(UWEM
and
web@X)
by
comparing
two
tools
(eChecker
and
eXaminator)
with
a
different
interpretation
of
success
criteria
and
coverage,
assessing
the
accessibility
of
about
300
pages.
An
initial
experiment
shows
there
is
a
positive
moderate
correlation
between
those
tools.
Reliability
of
metrics
very
often
depends
on
the
reliability
of
the
underlying
testing
tools,
and
it
is
well
known
that
different
tools
produce
different
results
on
the
same
pages.
During
the
webinar
it
was
noted
that
this
problem
could
lead
to
situations
where
low
credibility
is
attributed
to
tools
and
metrics;
metrics
would
make
it
even
more
difficult
to
compare
different
outcomes
and
diagnose
bad
behavior.
In
addition,
stakeholders
could
be
tempted
to
adopt
the
metrics
that
provides
the
best
results
on
their
pages,
or
those
that
can
be
more
easily
interpreted
and
explained,
regardless
of
whether
it
is
related
to
accessibility.
However,
as
we
mention
previously,
we
should
be
cautious
about
when
we
should
expect
reliable
behaviour
behavior
across
tools,
guidelines
or
domain.
The availability of metrics in terms of publicly available algorithms, APIs or tools is critical for their broad adoption. Providing such mechanisms will help facilitate a broader adoption of metrics by stakeholders - especially by those that, even if interested in using them, do not have the resources to operate and articulate them. There are some incipient proposals in this direction that implement a set of metrics: Naftali and Clúa [ Naftali ] presented a platform where failure-rate and UWEM are deployed. However this does entail that human intervention is required as the system needs the input of experts to discard false positives. There are some other tools that help to keep track of the accessibility level of websites over time [ Battistelli11a ]. These sort of tools tend to target the accessibility monitoring of websites within determined geographical locations, normally municipalities or regional governments. The tool support provided by Fernandes et al. [ NFernandes11a ], QualWeb, incorporates a feature within traditional accessibility testing tools to detect templates; the novelty of this approach is that the metric employed uses the accessibility of the template as a baseline. As a result, accessibility is measured from such starting point. If the accessibility problems of the template were repaired, these fixes would automatically spread to all the pages built upon the template. Therefore, the distance from a particular web page to the template (or baseline) can be used to estimate the effort required to fix this instance, which is very valuable for quality assurance.
Large scale evaluation and measurement is required for those websites that contain a great deal of pages or when a number of websites have to be evaluated. Managing these large volumes of data cannot be done without the help of automated tools. An example of large websites is provided by Fernandes et al. [ NFernandes11a ]. They present a method for template detection that aims at lessening the computing effort of evaluating large amounts of pages. This is useful for websites that are substantially built on templates such as on-line stores. In the on-line stores example, normally, the only content that changes is the item to be sold and the related information; however, the layout and internal structure stays the same. One example that contemplates the measurement of the accessibility of large number of distinct websites is depicted by Battistelli et al. [ Battistelli11a ] using the BIF metric; similarly, AMA is a platform that enables keeping track of a large number of websites which is used to measure how conformant the websites of specific geographical locations are. Finally, Nietzio et al. [ Nietzio ] present a metric to measure WCAG 2.0 conformance in the context of a platform to keep track of the accessibility of Norwegian municipalities.
Battistelli
et
al.
[
Battistelli11a
]
present
a
metric
to
quantify
the
compliance
of
documents
with
respect
to
their
DTDs.
Instead
of
measuring
this
compliance
as
if
it
was
a
binary
variable
(conformant/non-confomant),
(conformant/non-conformant),
compliance
is
measured
as
the
distance
from
the
current
document
to
the
ideal
one.
Although
its
relationship
with
accessibility
is
not
very
apparent,
code
compliance
is
one
of
the
technical
accessibility
requirements
according
to
the
Italian
regulation
and
it
also
impacts
on
those
success
criteria
that
require
the
correct
use
of
standards
[see
WCAG
SC
2.0
Success
Criterion
4.1.1
Parsing
].
Also,
this
approach
could
be
followed
to
measure
accessibility.
For
instance,
a
web
page
could
be
improved
until
it
was
accessible
according
to
guidelines
or
until
it
provides
an
acceptable
experience
to
end
users.
The
accessibility
level
of
the
non-accessible
page
could
be
computed
in
terms
of
the
effort
required
to
build
the
ideal
web
page
in
terms
of
coding
lines,
mark-up
tags
introduced
or
removed,
or
time.
Another
approach
that
tackles
a
particular
accessibility
problem
is
addressed
by
Rello
and
Baeza-Yates
[
Rello
]
who
address
the
measurement
of
text
legibility.
This
is
something
that
affects
the
understandability
of
a
document,
a
fundamental
accessibility
principle
[see
the
Understandable
principle].
The
interesting
contribution
of
this
work
is
its
reliance
on
a
quantitative
model
of
spelling
errors
automatically
computed
from
a
large
set
of
pages
handled
by
a
search
engine.
Compliance
with
the
DTD
and
legibility
of
a
web
document
can
be
considered
not
only
accessibility
success
criteria
but
also
quality
issues.
When it comes to innovative ways of measuring, the distance from a given document to a reference model can inspire similar approaches to measure web accessibility. As suggested by [ Battistelli11b ], compliance can be measured by considering the distance between a given document and an ideal (or acceptable) one. In this case this distance can be measured, for instance, in terms of missing hypertext tags or effort required to accomplish changes. Another example is illustrated by measuring the distance from a instance document to a baseline template using a metric [ NFernandes11a ]. Another novel way of measuring accessibility can be by using a grading scale and an arbitration process, as proposed by Fischer and Wyatt [ Fischer ]: the use of a five-point Likert scale aims at going beyond a binary accessible/non-accessible scoring scale. It would be interesting to see, in the future, how the final outcome of an evaluation depends on the original scores given by individual evaluators and what level of agreement exists between evaluators before arbitration takes place.
Vigo
[
Vigo11c
]
proposes
a
method
by
which,
depending
on
the
context,
the
number
of
checkpoints
to
be
met
changes.
Nietzio
et
al.
[
Nietzio
]
suggest
a
stepwise
method
to
measure
conformance
to
WCAG
2.0,
where
aspects
of
success
criteria
applicability
or
tool
support
are
considered.
Such
a
method
adapts
to
the
specific
testing
procedures
of
WCAG
2.0
success
criteria
(SC)
by
providing
a
set
of
decision
rules:
first,
the
applicability
of
SC
is
analysed;
analyzed;
second,
if
applicable,
the
SC
is
tested;
third,
if
a
common
failure
is
not
found,
the
implementation
of
the
sufficient
techniques
is
checked;
and
finally,
tool
support
is
checked
for
the
techniques
identified
in
the
previous
step.
The
metric
computed
as
a
result
of
this
process
is
a
failure
rate
that
takes
into
account
also
the
logic
underlying
necessary,
sufficient
and
counter-example
techniques
for
each
SC.
Vigo [ Vigo11c ] proposes a method that not only considers guidelines when measuring accessibility conformance, but also considers the specific features of the device (e.g., screen size, keyboard support) as well as the assistive technology operated by the users. Including these contextual characteristics of the interaction could lead to more faithful measurements of the experience. Finally, Sloan and Kelly [ Sloan ] claim that understanding accessibility as conformance to guidelines is risky in those countries (e.g., the UK) where accessibility assessment is not limited to guidelines but also focuses on the delivered service and user experience. Therefore, they encourage moving forward and embracing accessibility in terms of user experience and thinking of conformance of the production process, rather than conformance of a product that constantly changes. This perspective is novel in that it looks beyond the current conformance paradigm and aims to tap more into the user experience, and this is something that is not necessarily defined by current methods of technical validation or document conformance.
The authors of the above papers were inquired about some aspects of web accessibility metrics. The first aspect is about the target users of metrics; the goal of this question is to ascertain whether metrics researchers have in mind application scenarios or the profile of the end user who will make decisions based on the scores provided by metrics. Our survey shows that the majority of respondents do not have in mind a specific end user of metrics, or their answers are too generic. However, three papers are focused on web accessibility benchmarking (see [ Nietzio , Battistelli11a , JFernandes ]) and some others could be applied in this domain. This means that this is the application scenario with broader acceptance and where the application of metrics is taking off. In the remaining scenarios (quality assurance, information retrieval and adaptive web) there are also potential applications although the intent of applying in these scenarios is not evident.
Second,
we
wanted
to
know
whether
accessibility
metrics
researchers
are
aware
of
the
costs
and
risk
incurred
by
having
incorrect
values
for
metrics.
Most
users
consider
that
validity
and
reliability
of
metrics
should
be
guaranteed
although
many
contemplate
it
as
future
work.
There
is
some
tendency
towards
employing
experts
in
such
validations
although
most
agree
that
users
will
have
the
last
word
as
fas
far
as
validation
is
concerned.
This
is
closely
related
to
our
last
question
about
what
is
the
research
community's
point
of
view
on
measuring
accessibility
beyond
conformance
metrics.
All
answers
we
received
claimed
that
measuring
accessibility
in
terms
of
user
experience
should
be
explored
more
thoroughly.
This research report aims at highlighting current efforts in investigating accessibility metrics as well as uncovering existing challenges. Research on web accessibility metrics is taking off as the benefits of using them are becoming apparent; however, their adoption is far from being widespread. In addition to their relative novelty, this may occur because (1) there are a plethora of metrics out there and frameworks for metrics comparison that show their strengths and weakness are relatively recent [ Vigo11a ]; (2) quality frameworks require further investigation as there are unexplored areas for each of the defined qualities - these areas are uncovered in section 4.1; (3) the low validity of existing metrics, which calls for a standardized testbed to show how they perform with regard to metrics quality. Setting up a corpus of web pages for benchmarking purposes could be the first step towards this goal. It would work in the same way that the Information Retrieval community does to test the performance of their algorithms [see the Text Retrieval Conference, TREC ] - see section 4.2. A side-effect of the lack of validity and reliability of metrics is their lack of credibility. This could partially be tackled by the mentioned benchmarking corpus. However the credibility problem goes beyond - see section 4.3. Finally, some other issues such as user-tailored metric and dealing with dynamic content require special attention for those who aim at conducting research on web accessibility metrics.
To be more precise and focusing on investigating accessibility metric quality there are still many challenges to pursue. The way a metric satisfies validity, reliability, sensitivity, adequacy and complexity qualities remains open and can be addressed by the following questions. Even if all qualities are important, we emphasize that validity and reliability of metrics should be given priority. It does not matter how sensitive or adequate a metric is, if we cannot ensure its reliability and especially validity.
Studies of "validity with respect to conformance" could focus on the following research questions:
The above questions could be addressed in the following way:
Studies of "validity with respect to accessibility in use" should overcome the evaluator effect [ Hornbæk ] and lack of agreement of users in their severity ratings [ Petrie ] and could address the following questions:
Some efforts to understand metric reliability could go in the following direction:
Experiments
could
be
set
up
to
perform
sensitivity
analysis:
given
a
set
of
accessibility
problems
in
a
test
website,
they
could
be
systematically
turned
on
or
off,
and
their
effects
on
metric
values
could
be
analysed
analyzed
to
find
out
which
kinds
of
problems
had
the
largest
effect
and
under
which
circumstances.
Provided
that
valid
and
reliable
metrics
were
used,
this
could
tell
us
which
accessibility
barriers
would
have
a
more
or
less
strong
impact
on
conformance
or
use.
Provided
that
a
metric
is
valid
and
reliable,
research
directions
about
metric
adequacy
should
analyse
analyze
the
suitability
and
usefulness
of
its
values
for
users
in
different
scenarios,
as
well
as
metric
visualization
and
presentation
issues.
The most important issue about metric complexity relies on its relationship with the rest of the qualities. In this regard we can pose the following questions:
One
option
to
have
a
common
playground
so
that
the
research
community
could
shed
some
light
on
these
challenges
would
be
to
organise
organize
the
same
kind
of
competitions
as
the
TREC
experiments.
Recently,
some
efforts
have
been
directed
towards
this
goal
by
the
W3C
or
in
the
context
of
the
BenToWeb
project.
There
are
several
issues
that
need
to
be
tackled.
To start with, pages we know are accessible could be collected, and pages where we know they are not (because we injected faults in them or collected from some other repositories such as www.fixtheweb.net ), and ask participants to apply their metrics to such pages and tell us how far apart are the accessible pages from the non-accessible ones. Another option would be to use pages from initiatives such as the one promoted by the WAI, " BAD: Before and After Demonstration " where for educational purposes, the process of transforming a non-accessible page into an accessible one is shown.
Accessibility scores are a great device to grasp the accessibility level of web pages. However, metrics can turn out to be a double-edged sword: while they enhance comprehension, they can also hide relevant information and details on the accessibility of a page. This side effect can lead end users to choose the most lenient scores among those metrics that are available. As a result, there is a risk of hindering the credibility and trust of accessibility metrics.
The fact that different evaluation tools yield different results directly affects metric validity and, in particular, metric reliability. The poor reproducibility of evaluation reports and accessibility scores has a side-effect on the perception of individuals in that the web accessibility assessment process can be regarded as not very credible.
There is a challenge for the personalization of metrics as not all success criteria impact all users in the same way. While some have tried to group guidelines according to their impact in determined user groups, user needs can be so specific that the effect of a given barrier is more closely related to his/her individual abilities and cannot be inferred from the fact that a particular user is identified as having a particular disability. Individual needs may deviate considerably from groups guidelines (e.g., a motor-impaired individual having more residual physical abilities than the group guidelines foresee). There are some research actions that could be taken to improve user-tailored metrics:
Measuring
something
that
changes
over
time
can
give
different
results
depending
of
the
magnitude
of
such
changes.
Modern
web
pages
are
dynamic,
changing
their
content
over
time.
These
changes
are
not
always
a
reaction
to
user
interaction
but
can
also
be
due
to
some
other
factors
such
as
time
or
location.
Especially
in
Rich
Internet
Applications
these
updates
are
frequently
provoked
by
scripting
techniques
that
mutate
web
contents.
Therefore,
the
mark-up
gives
few
hints
to
predict
the
behaviour
behavior
of
a
web
document.
Normally,
the
most
appropriate
way
to
assess
the
current
instance
of
a
dynamic
web
document
is
to
retrieve
and
test
its
DOM;
then
its
subsequent
mutations
should
be
monitored
and
tested.
As
expected,
different
instances
of
a
document
caused
by
updates
show
inconsistent
accessibility
evaluation
results
[
Fernandes11
].
As
a
result,
if
a
metric
is
sensitive
enough,
it
should
be
able
to
reflect
this
updates.
This
area
calls
for
research
on
the
frequency
of
the
testings,
testing,
that
is,
should
pages
be
tested
every
time
they
update
or
should
it
be
retrieved
at
sampling
intervals?
Additionally,
there
are
some
other
questions:
what
would
be
the
accessibility
score
of
a
determined
URL
if
page
updates
entail
changes
in
the
accessibility?
Should
an
average
of
all
instances
be
cumulated?
The conformance to WAI-ARIA and the accessibility elements subsumed by HTML5 could also be explored by future accessibility metrics.
This research report introduces web accessibility metrics: they have been defined and specified, the benefits of using them have been highlighted and some possible application scenarios have been described. Spurred by the growing number of different metrics that are being released, we present a framework that encompasses the qualities that a good metric should have. As a result, metrics can be benchmarked according to their validity, reliability, sensitivity, adequacy and complexity. We believe this framework can help individuals to make decisions on the adoption of existing metrics according to the qualities required from metrics. In this way, there will not be the need to reinvent the wheel and design new metrics if available metrics already fit one's needs.
A
symposium
was
held
in
order
to
check
how
metrics
address
the
above-mentioned
qualities
and
to
keep
track
of
current
efforts
targeting
quality
issues
of
accessibility
metrics.
The
webminar
webinar
provided
a
partial,
but
concrete,
snapshot
of
most
of
the
research
activity
around
this
topic
We
found
that
tool
reliability
is
a
recurrent
topic
in
this
regard,
and
there
is
still
a
long
way
to
go
in
the
realm
of
methods
and
examples
for
metric
validity,
which
are
rare.
The
editors
of
this
research
report
believe
that
more
efforts
should
be
directed
to
investigate
the
validity
and
reliability
of
metrics.
Employing
metrics
whose
validity
and
reliability
is
questionable
is
a
very
risky
practice
that
should
be
avoided.
We
therefore
claim
that
accessibility
metrics
should
be
used
and
designed
responsibly.
One
way
to
hide
the
inherent
complexity
of
metrics
is
to
provide
tools
that
facilitate
their
application
in
an
automatic
automated
or
semi-automatic
semi-automated
way.
This
need
for
automatization
automation
comes
from
the
necessity
of
assessing
large
volumes
of
data
and
websites;
that
is
why
large
scale
analysis
of
accessibility
calls
for
metrics
that
can
easily
be
deployed
and
implemented.
Some
other
efforts
are
targeting
specific
quality
aspects
of
the
Web
such
as
the
lexical
quality
or
the
compliance
to
DTDs.
Finally,
an
emerging
trend
aims
at
measuring
accessibility
not
only
in
pure
compliance
terms.
Since
contextual
factors
play
an
important
role
in
determining
the
quality
of
user
experience,
accessibility
measurement
should
be
able
to
consider
these
factors
by
collecting
and
including
them
in
the
measurement
process
or
by
observing
the
behaviour
behavior
and
performance
of
real
users
on
real
settings
a
la
usability
testing.
This
perspective
can
be
understood
as
a
complementary
approach
to
current
accessibility
measurement
practice.
Based on the needs and gaps that hinder current accessibility measurement we propose a number of research avenues that can help to boost the acceptance and quality of accessibility metrics. Mostly, quality issues of metric validity and reliability need urgent action but there are also some other actions that can help to make metrics more credible and widespread. A common corpus for metrics benchmarking would be a good step in this direction as it could potentially tackle quality and credibility issues at the same time. Dynamic content and user-tailoring aspects can open new research paths that can have strong impact on the quality of assessment practices, methodologies and tools.
This document should be cited as follows:
M. Vigo, G. Brajnik, J. O Connor, eds. Research Report on Web Accessibility Metrics. W3C WAI Research and Development Working Group (RDWG) Notes. (2012) Available at: http://www.w3.org/TR/accessibility-metrics-report
The latest version of this document is available at:
http://www.w3.org/TR/accessibility-metrics-report/
A permanent link to this version of the document is:
http://www.w3.org/TR/2012/NOTE-accessibility-metrics-report-201206xxhttp://www.w3.org/TR/2012/NOTE-accessibility-metrics-report-20120712/
A BibTex file is provided containing:
@incollection {accessibility-metrics-report_FPWD, author = {W3C WAI Research and Development Working Group (RDWG)}, title = {Research Report on Web Accessibility Metrics}, booktitle = {W3C WAI Symposium on Website Accessibility Metrics}, publisher = {W3C Web Accessibility Initiative (WAI)}, year = {2012}, month = {July}, editor = {Markel Vigo and Giorgio Brajnik and Joshue O Connor eds.}, series = {W3C WAI Research and Development Working Group (RDWG) Notes}, type = {Research Report}, edition = {First Public Working Draft}, url = {http://www.w3.org/TR/accessibility-metrics-report}, }
The links provided in this section, including those in the BibTex files, are permanent; see also the W3C URI Persistence Policy .
@proceedings{accessibility-metrics-proceedings, title = {W3C WAI Symposium on Website Accessibility Metrics}, year = {2011}, editor = {W3C WAI Research and Development Working Group (RDWG)}, series = {W3C WAI Research and Development Working Group (RDWG) Symposia}, publisher = {W3C Web Accessibility Initiative (WAI)}, url = {http://www.w3.org/WAI/RD/2011/metrics/}, }
@inproceedings{naftali2011, author = {Maia Naftali and Osvaldo Cl\'{u}a}, title = {Integration of Web Accessibility Metrics into a Semi-Automatic evaluation process},booktitle = {W3C/WAI Symposium on Website Accessibility Metrics},booktitle = {W3C WAI Symposium on Website Accessibility Metrics}, year = {2011}, editor = {Markel Vigo, Giorgio Brajnik, Joshue O Connor}, pages = {article 1}, url = {http://www.w3.org/WAI/RD/2011/metrics/paper1/}, }
@inproceedings{battistelli2011a, author = {Matteo Battistelli and Silvia Mirri and Ludovico Antonio Muratori and Paola Salomoni}, title = {Measuring accessibility barriers on large scale sets of pages},booktitle = {W3C/WAI Symposium on Website Accessibility Metrics},booktitle = {W3C WAI Symposium on Website Accessibility Metrics}, year = {2011}, editor = {Markel Vigo, Giorgio Brajnik, Joshue O Connor}, pages = {article 2}, url = {http://www.w3.org/WAI/RD/2011/metrics/paper2/}, }
@inproceedings{nfernandes2011, author = {N\'{a}dia Fernandes and Rui Lopes and Lu\'{i}s Carri\c{c}o}, title = {A Template-aware Web Accessibility metric},booktitle = {W3C/WAI Symposium on Website Accessibility Metrics},booktitle = {W3C WAI Symposium on Website Accessibility Metrics}, year = {2011}, editor = {Markel Vigo, Giorgio Brajnik, Joshue O Connor}, pages = {article 3}, url = {http://www.w3.org/WAI/RD/2011/metrics/paper3/}, }
@inproceedings{battistelli2011b, author = {Matteo Battistelli and Silvia Mirri and Ludovico Antonio Muratori and Paola Salomoni}, title = {A metrics to make different DTDs documents evaluations comparable},booktitle = {W3C/WAI Symposium on Website Accessibility Metrics},booktitle = {W3C WAI Symposium on Website Accessibility Metrics}, year = {2011}, editor = {Markel Vigo, Giorgio Brajnik, Joshue O Connor}, pages = {article 4}, url = {http://www.w3.org/WAI/RD/2011/metrics/paper4/}, }
@inproceedings{rello2011, author = {Luz Rello and Ricardo Baeza-Yates}, title = {Lexical Quality as a Measure for Textual Web Accessibility},booktitle = {W3C/WAI Symposium on Website Accessibility Metrics},booktitle = {W3C WAI Symposium on Website Accessibility Metrics}, year = {2011}, editor = {Markel Vigo, Giorgio Brajnik, Joshue O Connor}, pages = {article 5}, url = {http://www.w3.org/WAI/RD/2011/metrics/paper5/}, }
@inproceedings{vigo2011a, author = {Markel Vigo and Julio Abascal and Amaia Aizpurua and Myriam Arrue}, title = {Attaining Metric Validity and Reliability with the Web Accessibility Quantitative Metric},booktitle = {W3C/WAI Symposium on Website Accessibility Metrics},booktitle = {W3C WAI Symposium on Website Accessibility Metrics}, year = {2011}, editor = {Markel Vigo, Giorgio Brajnik, Joshue O Connor}, pages = {article 6}, url = {http://www.w3.org/WAI/RD/2011/metrics/paper6/}, }
@inproceedings{fischer2011, author = {Detlev Fischer and Tiffany Wyatt}, title = {The case for a WCAG-based evaluation scheme with a graded rating scale},booktitle = {W3C/WAI Symposium on Website Accessibility Metrics},booktitle = {W3C WAI Symposium on Website Accessibility Metrics}, year = {2011}, editor = {Markel Vigo, Giorgio Brajnik, Joshue O Connor}, pages = {article 7}, url = {http://www.w3.org/WAI/RD/2011/metrics/paper7/}, }
@inproceedings{jfernandes2011, author = {Jorge Fernandes and Carlos Benavidez}, title = {A zero in eChecker equals a 10 in eXaminator: a comparison between two metrics by their scores},booktitle = {W3C/WAI Symposium on Website Accessibility Metrics},booktitle = {W3C WAI Symposium on Website Accessibility Metrics}, year = {2011}, editor = {Markel Vigo, Giorgio Brajnik, Joshue O Connor}, pages = {article 8}, url = {http://www.w3.org/WAI/RD/2011/metrics/paper8/}, }
@inproceedings{vigo2011b, author = {Markel Vigo}, title = {Context-Tailored Web Accessibility Metrics},booktitle = {W3C/WAI Symposium on Website Accessibility Metrics},booktitle = {W3C WAI Symposium on Website Accessibility Metrics}, year = {2011}, editor = {Markel Vigo, Giorgio Brajnik, Joshue O Connor}, pages = {article 9}, url = {http://www.w3.org/WAI/RD/2011/metrics/paper9/}, }
@inproceedings{sloan2011, author = {David Sloan and Brian Kelly}, title = {Web Accessibility Metrics For A Post Digital World},booktitle = {W3C/WAI Symposium on Website Accessibility Metrics},booktitle = {W3C WAI Symposium on Website Accessibility Metrics}, year = {2011}, editor = {Markel Vigo, Giorgio Brajnik, Joshue O Connor}, pages = {article 10}, url = {http://www.w3.org/WAI/RD/2011/metrics/paper10/}, }
@inproceedings{nietzio2011, author = {Annika Nietzio and Mandana Eibegger and Morten Goodwin and Mikael Snaprud}, title = {Towards a score function for WCAG 2.0 benchmarking},booktitle = {W3C/WAI Symposium on Website Accessibility Metrics},booktitle = {W3C WAI Symposium on Website Accessibility Metrics}, year = {2011}, editor = {Markel Vigo, Giorgio Brajnik, Joshue O Connor}, pages = {article 11}, url = {http://www.w3.org/WAI/RD/2011/metrics/paper11/}, }
Participants
of
the
W3C/WAI
W3C
WAI
Research
and
Development
Working
Group
(RDWG)
involved
in
the
development
of
this
document
include:
Christos
Kouroupetroglou,
Giorgio
Brajnik,
Joshue
O
Connor,
Klaus
Miesenberger,
Markel
Vigo,
Peter
Thiessen,
Shadi
Abou-Zahra,
Shawn
Henry,
Simon
Harper,
Vivienne
Conway,
and
Yeliz
Yesilada.
RDWG would also like to thank the chairs and scientific committee members as well as the paper authors of the RDWG online symposium on Website Accessibility Metrics .
This document was developed with support from the WAI-ACT Project .