Editorial
note:
This
section
is
a
rough
draft.
It
will
be
edited
to
align
with
How
People
with
Disabilities
Use
the
Web
once
that
document
is
complete.
This
draft
is
included
now
to
provide
general
background
for
sections
2
and
3
of
this
document.
Comprehension
of
media
may
be
affected
by
loss
of
visual
function,
loss
of
audio
function,
cognitive
issues,
or
a
combination
of
all
three.
Cognitive
disabilities
may
affect
access
to
and/or
comprehension
of
media.
Physical
disabilities
such
as
dexterity
impairment,
loss
of
limbs,
or
loss
of
use
of
limbs
may
affect
access
to
media.
Once
richer
forms
of
media,
such
as
virtual
reality,
become
more
commonplace,
tactile
issues
may
come
into
play.
Control
of
the
media
player
can
be
an
important
issue,
e.g.,
for
people
with
physical
disabilties,
disabilities,
however
this
is
typically
not
addressed
by
the
media
formats
themselves,
but
is
a
requirement
of
the
technology
used
to
build
the
player.
2.1
Blindness
People
who
are
blind
cannot
access
information
if
it
is
presented
only
in
the
visual
mode.
They
require
information
in
an
alternative
representation,
which
typically
means
the
audio
mode,
although
information
can
also
be
presented
as
text.
It
is
important
to
remember
that
not
only
the
main
video
is
inaccessible,
but
any
other
visible
ancillary
information
such
as
stock
tickers,
status
indicators,
or
other
on-screen
graphics,
as
well
as
any
visual
controls
needed
to
operate
the
content.
Since
people
who
are
blind
use
a
screen
reader
and/or
refreshable
braille
display,
these
assistive
technologies
(
AT
s)
(ATs)
need
to
work
hand-in-hand
with
the
access
mechanism
provided
for
the
media
content.
2.2
Low
vision
People
with
low
vision
can
use
some
visual
information.
Depending
on
their
visual
ability
they
might
have
specific
issues
such
as
difficulty
discriminating
foreground
information
from
background
information,
or
discriminating
colors.
Glare
caused
by
excessive
scattering
in
the
eye
can
be
a
significant
problem,
challenge,
especially
for
very
bright
content
or
surroundings.
They
may
be
unable
to
react
quickly
to
transient
information,
and
may
have
a
narrow
angle
of
view
and
so
may
not
detect
key
information
presented
temporarily
where
they
are
not
looking,
or
in
text
that
is
moving
or
scrolling.
A
person
using
a
low-vision
AT
aid,
such
as
a
will
likely
use
screen
magnifier,
magnification
software.
This
means
that
they
will
only
be
viewing
a
portion
of
the
screen,
and
so
must
manage
tracking
media
content
via
their
AT
.
They
may
have
difficulty
reading
when
text
is
too
small,
has
poor
background
contrast,
contrast
(too
high
or
too
low),
or
when
outline
outlined
or
other
fancy
font
types
or
effects
are
used.
If
the
font
is
an
image,
it
is
likely
to
appear
grainy
when
magnified.
They
may
be
using
an
AT
that
adjusts
all
the
colors
of
the
screen,
such
as
inverting
the
colors,
so
the
media
content
must
be
viewable
through
the
AT
.
Users
with
low
vision
will
often
benefit
from
the
same
text
streams
and
instructions
that
are
sometimes
hidden
or
displayed
off
screen
for
users
of
screen
readers
or
refreshable
Braille.
2.3
Atypical
color
perception
A
significant
percentage
of
the
population
has
atypical
color
perception,
and
may
not
be
able
to
discriminate
between
different
colors,
or
may
miss
key
information
when
coded
with
color
only.
They
might
have
difficulty
discriminating
foreground
information
from
background
information,
or
discriminating
colors.
Such
issues
can
be
minimized
when
the
user
has
the
ability
to
customize
the
color
and
contrast
of
text
content.
2.4
Deafness
People
who
are
deaf
generally
cannot
use
audio.
Thus,
an
alternative
representation
is
required,
typically
through
synchronized
captions
and/or
sign
translation.
2.5
Hard
of
hearing
People
who
are
hard
of
hearing
may
be
able
to
use
some
audio
material,
but
might
not
be
able
to
discriminate
certain
types
of
sound,
and
may
miss
any
information
presented
as
audio
only
if
it
contains
frequencies
they
can't
hear,
or
is
masked
by
background
noise
or
distortion.
They
may
miss
audio
which
is
too
quiet,
or
of
poor
quality.
Speech
may
be
problematic
challenging
if
it
is
too
fast
and
cannot
be
played
back
more
slowly.
Information
presented
using
multichannel
audio
(e.g.,
stereo)
may
not
be
perceived
by
people
who
are
deaf
in
one
ear.
2.6
Deaf-blind
Individuals
who
are
deaf-blind
have
a
combination
of
conditions
that
may
result
in
one
of
the
following:
blindness
and
deafness;
blindness
and
difficulty
in
hearing;
low
vision
and
deafness;
or
low
vision
and
difficulty
in
hearing.
Depending
on
their
combination
of
conditions,
individuals
who
are
deaf-blind
may
need
captions
that
can
be
enlarged,
changed
to
high-contrast
colors,
or
otherwise
styled;
or
they
may
need
captions
and/or
described
video
that
can
be
presented
with
AT
(e.g.,
a
refreshable
braille
display).
They
may
need
synchronized
captions
and/or
described
video,
or
they
may
need
a
non-time-based
transcript
which
they
can
read
at
their
own
pace.
2.7
Physical
impairment
People
with
physical
disabilities
such
as
poor
dexterity,
loss
of
limbs,
or
loss
of
use
of
limbs
may
use
the
keyboard
alone
rather
than
the
combination
of
a
pointing
device
plus
keyboard
to
interact
with
content
and
controls,
or
may
use
a
switch
with
an
on-screen
keyboard,
or
other
assistive-technology
access.
assistive
technology.
The
player
itself
must
be
usable
via
the
keyboard
and
pointing
devices.
The
user
must
have
full
access
to
all
player
controls,
including
methods
for
selecting
alternative
content.
2.8
Cognitive
and
neurological
disabilities
Cognitive
and
neurological
disabilities
include
a
wide
range
of
conditions
that
may
include
intellectual
disabilities
(called
learning
disabilities
in
some
regions),
autism-spectrum
disorders,
memory
impairments,
mental-health
disabilities,
attention-deficit
disorders,
audio-
and/or
visual-perceptive
disorders,
dyslexia
and
dyscalculia
(called
learning
disabilities
in
other
some
regions),
or
seizure
disorders.
Necessary
accessibility
supports
vary
widely
for
these
different
conditions.
Individuals
with
some
conditions
may
process
information
aurally
better
than
by
reading
text;
therefore,
information
that
is
presented
as
text
embedded
in
a
video
should
also
be
available
as
audio
descriptions.
Individuals
with
other
conditions
may
need
to
reduce
distractions
or
flashing
in
presentations
of
video.
Some
conditions
such
as
autism-spectrum
disorders
may
have
multi-system
effects
and
individuals
may
need
a
combination
of
different
accommodation.
Overall,
the
media
experience
for
people
on
the
autism
spectrum
should
be
customizable
and
well
designed
so
as
to
not
be
overwhelming.
Care
must
be
taken
to
present
a
media
experience
that
focuses
on
the
purpose
of
the
content
and
provides
alternative
content
in
a
clear,
concise
manner.
3.
Alternative
Content
Technologies
A
number
of
alternative
content
types
have
been
developed
to
help
users
with
sensory
disabilities
gain
access
to
audio-visual
content.
This
section
lists
them,
explains
generally
what
they
are,
and
provides
a
number
of
requirements
on
each
that
need
to
be
satisfied
with
technology
developed
in
HTML5
around
the
media
elements.
3.1
Described
video
Described
video
contains
descriptive
narration
of
key
visual
elements
designed
to
make
visual
media
accessible
to
people
who
are
blind
or
visually
impaired.
The
descriptions
include
actions,
costumes,
gestures,
scene
changeset
or
any
other
important
visual
information
that
someone
who
cannot
see
the
screen
might
ordinarily
miss.
Descriptions
are
traditionally
audio
recordings
timed
and
recorded
to
fit
into
natural
pauses
in
the
program,
although
they
may
also
briefly
obscure
the
main
audio
track.
(See
the
section
on
extended
descriptions
for
an
alternative
approach.)
The
descriptions
are
usually
read
by
a
narrator
with
a
voice
that
cannot
be
easily
confused
with
other
voices
in
the
primary
audio
track.
They
are
authored
to
convey
objective
information
(e.g.,
a
yellow
flower)
rather
than
subjective
judgments
(e.g.,
a
beautiful
flower).
As
with
captions,
descriptions
can
be
open
or
closed.
-
Open
descriptions
are
merged
with
the
program-audio
track
and
cannot
be
turned
off
by
the
viewer.
-
Closed
descriptions
can
be
turned
on
and
off
by
the
viewer.
They
can
be
recorded
as
a
separate
track
containing
descriptions
only,
timed
to
play
at
specific
spots
in
the
timeline
and
played
in
parallel
with
the
program-audio
track.
-
Some
descriptions
can
be
delivered
as
a
separate
audio
channel
mixed
in
at
the
player.
-
Other
options
include
a
computer-generated
‘text
to
speech’
track,
also
known
as
text
video
descriptions.
This
is
described
in
the
next
subsection.
Described
video
provides
benefits
that
reach
beyond
blind
or
visually
impaired
viewers;
e.g.,
students
grappling
with
difficult
materials
or
concepts.
Descriptions
can
be
used
to
give
supplemental
information
about
what
is
on
screen—the
structure
of
lengthy
mathematical
equations
or
the
intricacies
of
a
painting,
for
example.
Described
video
is
available
on
some
television
programs
and
in
many
movie
theaters
in
the
U.S.
and
other
countries.
Regulations
in
the
U.S.
and
Europe
are
increasingly
focusing
on
description,
especially
for
television,
reflecting
its
priority
with
citizens
who
have
visual
impairments.
The
technology
needed
to
deliver
and
render
basic
video
descriptions
is
in
fact
relatively
straightforward,
being
an
extension
of
common
audio-processing
solutions.
Playback
products
must
support
multi-audio
channels
required
for
description,
and
any
product
dealing
with
broadcast
TV
content
must
provide
adequate
support
for
descriptions.
Descriptions
can
also
provide
text
that
can
be
indexed
and
searched.
Systems
supporting
described
video
that
are
not
open
descriptions
must:
[DV-1]
Provide
an
indication
that
descriptions
are
available,
and
are
active/non-active.
[DV-2]
Render
descriptions
in
a
time-synchronized
manner,
using
the
media
resource
as
the
timebase
master.
[DV-3]
Support
multiple
description
tracks
(e.g.,
discrete
tracks
containing
different
levels
of
detail).
[DV-4]
Support
recordings
of
real
human
speech
as
a
track
of
the
media
resource,
or
as
an
external
file.
[DV-5]
Allow
the
author
to
independently
adjust
the
volumes
of
the
audio
description
and
original
soundtracks.
[DV-6]
Allow
the
user
to
independently
adjust
the
volumes
of
the
audio
description
and
original
soundtracks,
with
the
user's
settings
overriding
the
author's.
[DV-7]
Permit
smooth
changes
in
volume
rather
than
stepped
changes.
The
degree
and
speed
of
volume
change
should
be
under
provider
user
control.
[DV-8]
Allow
the
author
to
provide
fade
and
pan
controls
to
be
accurately
synchronised
synchronized
with
the
original
soundtrack.
[DV-9]
Allow
the
author
to
use
a
codec
which
is
optimised
optimized
for
voice
only,
rather
than
requiring
the
same
codec
as
the
original
soundtrack.
[DV-10]
Allow
the
user
to
select
from
among
different
languages
of
descriptions,
if
available,
even
if
they
are
different
from
the
language
of
the
main
soundtrack.
[DV-11]
Support
the
simultaneous
playback
of
both
the
described
and
non-described
audio
tracks
so
that
one
may
be
directed
at
separate
outputs
(e.g.,
a
speaker
and
headphones).
[DV-12]
Provide
a
means
to
prevent
descriptions
from
carrying
over
from
one
program
or
channel
when
the
user
switches
to
a
different
program
or
channel.
[DV-13]
Allow
the
user
to
relocate
the
description
track
within
the
audio
field,
with
the
user
setting
overriding
the
author
setting.
The
setting
should
be
re-adjustable
as
the
media
plays.
[DV-14]
Support
metadata,
such
as
copyright
information,
usage
rights,
language,
etc.
3.2
Text
video
description
Described
video
that
uses
text
for
the
description
source
rather
than
a
recorded
voice
creates
specific
requirements.
Text
video
descriptions
(TVDs)
are
delivered
to
the
client
as
text
and
rendered
locally
by
assistive
technology
such
as
a
screen
reader
or
a
braille
device.
This
can
have
advantages
for
screen-reader
users
who
want
full
control
of
the
preferred
voice
and
speaking
rate,
or
other
options
to
control
the
speech
synthesis.
Text
video
descriptions
are
provided
as
text
files
containing
start
times
for
each
description
cue.
Since
the
duration
that
a
screen
reader
takes
to
read
out
a
description
cannot
be
determined
during
authoring
of
the
cues,
it
is
difficult
to
ensure
they
don't
obscure
the
main
audio
or
other
description
cues.
This
is
likely
to
be
caused
by
at
least
three
reasons:
-
An
author
of
text
video
descriptions
does
not
have
a
screen
reader.
This
means
s/he
cannot
check
if
the
description
fits
within
the
time
frame.
Even
if
s/he
has
a
screen
reader,
a
user's
screen
reader
will
be
set
to
a
different
reading
speed
and
may
take
longer
to
read
the
same
sentence.
-
Some
screen-reader
users
(e.g.,
those
who
are
elderly
or
have
learning
disabilities)
may
slow
down
the
speech
rate.
-
A
visually
complicated
scene
(e.g.,
figures
on
a
blackboard
in
an
online
physics
class)
may
require
more
description
time
than
is
available
in
the
program-audio
track.
People
with
low-vision
may
also
benefit
from
having
access
to
text
video
descriptions.
Systems
supporting
text
video
descriptions
must:
[TVD-1]
Support
presentation
of
text
video
descriptions
through
a
screen
reader
or
reader,
braille
device,
device
and/or
modified
print
with
playback
speed
control
and
control,
voice
control
and
synchronisation
synchronization
points
with
within
the
video.
[TVD-2]
TVDs
need
to
be
provided
in
a
format
that
contains
the
following
information:
-
start
time,
text
per
description
cue
(the
duration
is
determined
dynamically,
though
an
end
time
could
provide
a
cut
point)
-
possibly
a
speech-synthesis
markup
to
improve
quality
of
the
description
(existing
speech
synthesis
markups
include
SSML
and
Speech
CSS
3
Speech
Module
)
-
accompanying
metadata
providing
labeling
for
speakers,
language,
etc.
and
-
visual
style
markup
(see
section
on
Captioning
).
[TVD-3]
Where
possible,
provide
a
text
or
separate
audio
track
privately
to
those
that
need
it
in
a
mixed-viewing
situation,
e.g.,
through
headphones.
[TVD-4]
Where
possible,
provide
options
for
authors
and
users
to
deal
with
the
overflow
case:
continue
reading,
stop
reading,
and
pause
the
video.
(One
solution
from
a
user's
point
of
view
may
be
to
pause
the
video
and
finish
reading
the
TVD,
for
example.)
User
preference
should
override
authored
option.
[TVD-5]
Support
the
control
over
speech-synthesis
playback
speed,
volume
and
voice,
and
provide
synchronisation
synchronization
points
with
the
video.
3.3
Extended
video
descriptions
Video
descriptions
are
usually
provided
as
recorded
speech,
timed
to
play
in
the
natural
pauses
in
dialog
or
narration.
In
some
types
of
material,
however,
there
is
not
enough
time
to
present
sufficient
descriptions.
To
meet
such
cases,
the
concept
of
extended
description
was
developed.
Extended
descriptions
work
by
pausing
the
video
and
program
audio
at
key
moments,
playing
a
longer
description
than
would
normally
be
permitted,
and
then
resuming
playback
when
the
description
is
finished
playing.
This
will
naturally
extend
the
timeline
of
the
entire
presentation.
This
procedure
has
not
been
possible
in
broadcast
television;
however,
hard-disk
recording
and
on-demand
Internet
systems
can
make
this
a
practical
possibility.
Extended
video
description
(EVD)
has
been
reported
to
have
benefits
for
cognitive
disabilities;
for
example,
it
might
benefit
people
with
Aspergers
Asperger
Syndrome
and
other
Autistic
Spectrum
Disorders,
in
that
it
can
make
connections
between
cause
and
effect,
point
out
what
is
important
to
look
at,
or
explain
moods
that
might
otherwise
be
missed.
Systems
supporting
extended
audio
descriptions
must:
[EVD-1]
Support
detailed
user
control
as
specified
in
[TVD-4]
for
extended
video
descriptions.
[EVD-2]
Support
automatically
pausing
the
video
and
main
audio
tracks
in
order
to
play
a
lengthy
description.
[EVD-3]
Support
resuming
playback
of
video
and
main
audio
tracks
when
the
description
is
finished.
Note
that
this
Because
the
user
is
the
ultimate
arbiter
of
the
rate
at
which
TTS
playback
occurs,
it
is
not
feasible
for
an
advanced
feature
and
would
only
author
to
guarantee
that
any
texted
audio
description
can
be
expected
by
advanced
systems.
played
within
the
natural
pauses
in
dialog
or
narration
of
the
primary
audio
resource.
Therefore,
all
texted
descriptions
must
be
treated
as
extended
text
descriptions
potentially
requiring
the
pausing
and
resumption
of
primary
resource
playback.
3.4
Clean
audio
A
relatively
recent
development
in
television
accessibility
is
the
concept
of
clean
audio
,
which
takes
advantage
of
the
increased
adoption
of
multichannel
audio.
This
is
primarily
aimed
at
audiences
who
are
hard
of
hearing,
and
consists
of
isolating
the
audio
channel
containing
the
spoken
dialog
and
important
non-speech
information
that
can
then
be
amplified
or
otherwise
modified,
while
other
channels
containing
music
or
ambient
sounds
are
attenuated.
Using
the
isolated
audio
track
may
make
it
possible
to
apply
more
sophisticated
audio
processing
such
as
pre-emphasis
filters,
pitch-shifting,
and
so
on
to
tailor
the
audio
to
the
user's
needs,
since
hearing
loss
is
typically
frequency-dependent,
and
the
user
may
have
usable
hearing
in
some
bands
yet
none
at
all
in
others.
Systems
supporting
clean
audio
and
multiple
audio
tracks
must:
[CA-1]
Support
clean
audio
as
a
separate,
alternative
audio
track
from
other
audio-based
alternative
media
resources.
resources,
including
the
primary
audio
resource.
[CA-2]
Support
the
synchronisation
synchronization
of
multitrack
audio
either
within
the
same
file
or
from
separate
files
-
preferably
both.
[CA-3]
Support
separate
volume
control
of
the
different
audio
tracks.
[CA-4]
Support
pre-emphasis
filters,
pitch-shifting,
and
other
audio-processing
algorithms.
3.5
Content
navigation
by
content
structure
Most
people
are
familiar
with
fast
forward
and
rewind
in
media
content.
However,
because
they
progress
through
content
based
only
on
time,
fast
forward
and
rewind
are
ineffective
particularly
when
the
content
is
being
used
for
purposes
other
than
entertainment.
People
with
disabilities
are
also
particularly
disadvantaged
if
forced
to
rely
solely
on
time-based
fast
forward
and
rewind
to
study
content.
Fortunately,
most
content
is
structured,
and
appropriate
markup
can
expose
this
structure
to
forward
and
rewind
controls:
-
Books
generally
have
chapters
and
perhaps
subsections
within
those
chapters.
They
also
have
structures
such
as
page
numbers,
side-bars,
tables,
footnotes,
tables
of
contents,
glossaries,
etc.
-
Short
music
selections
tend
to
have
verses
and
repeating
choruses.
-
Larger
classical-music
works
have
movements
which
are
further
dividable
by
component
parts
can
be
divided
into
components
such
as
exposition,
development
and
recapitulation,
or
theme
and
variations.
-
Operas,
theatrical
plays,
and
movies
have
acts
and
scenes
within
those
acts.
-
Television
programs
generally
have
clear
divisions;
e.g.,
newscasts
have
individual
stories
usually
wrapped
within
a
larger
structures
called
news,
weather,
or
sports.
-
A
lecturer
may
first
lay
out
a
topic,
then
consider
a
series
of
approaches
or
illustrative
examples,
and
finally
draw
a
conclusion.
This
is,
of
course,
a
DOM
hierarchical
view
of
content.
However,
effective
DOM
-based
navigation
of
a
multi-level
hierarchy
will
require
an
additional
control
not
typically
available
on
using
current
media
players.
This
real-time
control,
mechanism,
which
we
are
calling
a
"granularity-level
control,"
will
allow
the
user
to
adjust
the
level
of
granularity
applied
to
"next"
and
"previous"
controls.
This
is
necessary
because
next
and
previous
are
too
cumbersome
if
accessing
every
DOM
element,
node
in
a
complex
hierarchy,
but
unsatisfactorally
unsatisfactorily
broad
and
coarse
if
set
to
only
the
top
hierarchical
DOM
level.
level
of
the
hierarchy.
Allowing
the
user
to
adjust
the
DOM
granularity
level
that
next
and
previous
go
apply
to
has
proven
very
effective—hence
the
real-time
adjustable
granularity
level
control.
Two
examples
of
granularity
levels
1.
In
a
news
broadcast,
the
most
global
level
(analogous
to
<h1>)
might
be
the
category
called
"news,
weather,
and
sports."
The
second
level
(analogous
to
<h2>)
would
identify
each
individual
news
(or
sports)
story.
With
the
granularity
control
set
to
level
1,
"next"
and
"previous"
would
cycle
among
news,
weather,
and
sports.
Set
at
level
2,
it
would
cycle
among
individual
news
(or
sports)
stories.
2.
In
a
bilingual
audiobook-plus-e-text
production
of
Dante
Alighieri's
"La
Divina
Commedia,"
the
user
would
choose
whether
to
listen
to
the
original
medieval
Italian
or
its
modern-language
translation—possibly
toggling
between
them.
Meanwhile,
both
the
original
and
translated
texts
might
appear
on
screen,
with
both
the
original
and
translated
text
highlighted,
line
by
line,
in
sync
with
the
audio
narration.
-
The
most
global
(<h1>)
level
would
be
each
individual
book—
"Inferno,"
"Purgatorio,"
and
"Paradiso."
-
The
second
(<h2>)
level
would
be
each
individual
canto.
-
The
third
(<h3>)
level
would
be
each
individual
verso.
-
The
fourth
(<h4>)
level
would
be
each
individual
line
of
poetry.
With
granularity
set
at
level
1,
"next"
and
"previous"
would
cycle
among
the
three
books
of
"La
Divina
Commedia."
Set
at
level
2,
they
would
cycle
among
its
cantos,
at
level
3
among
its
versos,
and
at
level
4
among
the
individual
lines
of
poetry
text.
Navigating
ancillary
content
There
is
a
kind
of
structure,
particularly
in
longer
media
resources,
which
requires
special
navigational
consideration.
While
present
in
the
media
resource,
it
does
not
fit
in
the
natural
beginning-to-end
progression
of
the
resource.
Its
consumption
tends
to
interrupt
this
natural
beginning-to-end
progression.
A
familiar
example
is
a
footnote
or
sidebar
in
a
book.
One
must
pause
reading
the
text
narrative
to
read
a
footnote
or
sidebar.
Yet
these
structures
are
important
and
might
require
their
own
alternative
media
renditions.
We
have
chosen
to
call
such
structures
"ancillary
content
structures."
Commercials,
news
briefs,
weather
updates,
etc.,
are
familiar
examples
from
television
programming.
While
so
prevalent
that
most
of
us
may
be
inured
to
it,
they
do
function
to
interrupt
the
primary
television
program.
Users
will
want
the
ability
to
navigate
past
these
ancillary
structures—or
perhaps
directly
to
them.
E-text-plus-audio
productions
of
titles
such
as
"La
Divina
Commedia,"
described
above,
may
well
include
reproductions
of
famous
frescoes
or
paintings
interspersed
throughout
the
text,
though
these
are
not
properly
part
of
the
text/content.
Such
illustrations
must
be
programatically
programmatically
discoverable
by
users.
They
also
need
to
be
described.
However,
the
user
needs
the
option
of
choosing
when
to
pause
for
that
interrupting
description.
One
current
HTML5
media-based
example
of
ancillary
content
is
the
Mozilla
Popcorn
Javascript
library
and
API
which
can
be
further
explored
with
the
following
three
resources:
Mozilla
PopcornOpenVideoAPI
documentation
Mozilla’s
Popcorn
Project
Adds
Extra
Flavor
to
Web
Video
blog
post
Popcorn.js
script
library
Additional
note
Media
in
HTML5
will
be
used
heavily
and
broadly.
These
accessibility
user
requirements
will
often
find
broad
applicability.
Just
as
the
structures
introduced
particularly
by
nonfiction
titles
make
books
more
usable,
media
is
more
usable
when
its
inherent
structure
is
exposed
by
markup.
Markup-based
access
to
structure
is
critical
for
persons
with
disabilities
who
cannot
infer
structure
from
purely
presentational
queues.
Structural
navigation
has
proven
highly
effective
in
various
programs
of
electronic
book
publication
for
persons
with
print
disabilities.
Nowadays,
these
programs
are
based
on
the
ANSI/NISO
Z39.86
specifications
.
Z39.86
structural
navigation
is
also
supported
by
e-publishing
industry
specifications
.
The
user
can
navigate
along
the
timebase
using
a
continuous
scale,
and
by
relative
time
units
within
rendered
audio
and
animations
(including
video
and
animated
images)
that
last
three
or
more
seconds
at
their
default
playback
rate.
(UAAG
2.0
4.9.6?)
2.11.6)
The
user
can
navigate
by
semantic
structure
within
the
time-based
media,
such
as
by
chapters
or
scenes,
if
present
in
the
media
(UAAG
2.0
4.9.7).
2.11.7).
Systems
supporting
content
navigation
must:
[CN-1]
Provide
a
means
to
structure
media
resources
so
that
users
can
navigate
them
by
semantic
content
structure,
e.g.,
through
adding
a
track
to
the
video
that
contains
navigation
markers
(in
table-of-content
style).
This
means
must
allow
authors
to
identify
ancillary
content
structures,
which
may
be
a
hierarchical
structure.
Support
keeping
all
media
representations
synchronised
synchronized
when
users
navigate.
[CN-2]
The
navigation
track
should
provide
for
hierarchical
structures
with
titles
for
the
sections.
[CN-3]
Support
both
global
navigation
by
the
larger
structural
elements
of
a
media
work,
and
also
the
most
localized
atomic
structures
of
that
work,
even
though
authors
may
not
have
marked-up
all
levels
of
navigational
granularity.
[CN-4]
Support
third-party
provided
structural
navigation
markup.
[CN-5]
Keep
all
content
representations
in
sync,
so
that
moving
to
any
particular
structural
element
in
media
content
also
moves
to
the
corresponding
point
in
all
provided
alternate
alternative
media
representations
(captions,
described
video,
transcripts,
etc)
associated
with
that
work.
[CN-6]
Support
direct
access
to
any
structural
element,
possibly
through
URIs.
[CN-7]
Support
pausing
primary
content
traversal
to
provide
access
to
such
ancillary
content
in
line.
[CN-8]
Support
skipping
of
ancillary
content
in
order
to
not
interrupt
content
flow.
[CN-9]
Support
access
to
each
ancillary
content
item,
including
with
"next"
and
"previous"
controls,
apart
from
accessing
the
primary
content
of
the
title.
[CN-10]
Support
that
in
bilingual
texts
both
the
original
and
translated
texts
can
appear
on
screen,
with
both
the
original
and
translated
text
highlighted,
line
by
line,
in
sync
with
the
audio
narration.
3.6
Captioning
For
people
who
are
deaf
or
hard-of-hearing,
captioning
is
a
prime
alternative
representation
of
audio.
Captions
are
in
the
same
language
as
the
main
audio
track
and,
in
contrast
to
foreign-language
subtitles,
render
a
transcription
of
dialog
or
narration
as
well
as
important
non-speech
information,
such
as
sound
effects,
music,
and
laughter.
Historically,
captions
have
been
either
closed
or
open.
Closed
captions
have
been
transmitted
as
data
along
with
the
video
but
were
not
visible
until
the
user
elected
to
turn
them
on,
usually
by
invoking
an
on-screen
control
or
menu
selection.
Open
captions
have
always
been
visible;
they
had
been
merged
with
the
video
track
and
could
not
be
turned
off.
Ideally,
captions
should
be
a
verbatim
representation
of
the
audio;
however,
captions
are
sometimes
edited
for
various
reasons—
for
example,
for
reading
speed
or
for
language
level.
In
general,
consumers
of
captions
have
expressed
that
the
text
should
represent
exactly
what
is
in
the
audio
track.
If
edited
captions
are
provided,
then
they
should
be
clearly
marked
as
such,
and
the
full
verbatim
version
should
also
be
available
as
an
option.
The
timing
of
caption
text
can
coincide
with
the
mouth
movement
of
the
speaker
(where
visible),
but
this
is
not
strictly
necessary.
For
timing
purposes,
captions
may
sometimes
precede
or
extend
slightly
after
the
audio
they
represent.
Captioning
should
also
use
adequate
means
to
distinguish
between
speakers
as
turn-taking
occurs
during
conversation;
this
has
in
the
past
been
done
by
positioning
the
text
near
the
speaker,
by
associating
different
colors
to
different
speakers,
or
by
putting
the
name
and
a
colon
in
front
of
the
text
line
of
a
speaker.
Captions
are
useful
to
a
wide
array
of
users
in
addition
to
their
originally
intended
audiences.
Gyms,
bars
bars,
and
restaurants
regularly
employ
captions
as
a
way
for
patrons
to
watch
television
while
in
those
establishments.
People
learning
to
read
or
learning
the
language
of
the
country
where
they
live
as
a
second
language
also
benefit
from
captions:
research
has
shown
that
captions
help
reinforce
vocabulary
and
language.
Captions
can
also
provide
a
powerful
search
capability,
allowing
users
and
search
engines
to
search
the
caption
text
to
locate
a
specific
video
or
an
exact
point
in
a
video.
Formats
for
captions,
subtitles
or
foreign-language
subtitles
must:
[CC-1]
Render
text
in
a
time-synchronized
manner,
using
the
media
resource
as
the
timebase
master.
Note
Most
of
the
time,
the
main
audio
track
would
be
the
best
candidate
for
the
timebase.
Where
a
video
without
audio,
but
with
a
text
track,
is
available,
the
video
track
becomes
the
timebase
master.
Also,
there
may
be
situations
where
an
explicit
timing
track
is
available.
[CC-2]
Allow
the
author
to
specify
erasures,
i.e.,
times
when
no
text
is
displayed
on
the
screen
(no
text
cues
are
active).
Note
This
should
be
possible
both
within
media
resources
and
caption
formats.
[CC-3]
Allow
the
author
to
assign
timestamps
so
that
one
caption/subtitle
follows
another,
with
no
perceivable
gap
in
between.
Note
This
means
that
caption
cues
should
be
able
to
either
let
the
start
time
of
the
subsequent
cue
be
determined
by
the
duration
of
the
cue
or
have
the
end
time
be
implied
by
the
start
of
the
next
cue.
For
overlapping
captions,
explicit
start
and
end
times
are
then
required.
[CC-4]
Be
available
in
a
text
encoding.
Note
This
means
that
determined
character
encodings
should
be
supported
-
which
could
be
either
by
making
the
character
encoding
explicit
or
by
enforcing
a
single
default
one
such
as
UTF-8.
[CC-5]
Support
positioning
in
all
parts
of
the
screen
-
either
inside
the
media
viewport
but
also
possibly
in
a
determined
space
next
to
the
media
viewport.
This
is
particularly
important
when
multiple
captions
are
on
screen
at
the
same
time
and
relate
to
different
speakers,
or
when
in-picture
text
is
avoided.
Note
The
minimum
requirement
is
a
bounding
box
(with
an
optional
background)
into
which
text
is
flowed,
and
that
probably
needs
to
be
pixel
aligned.
The
absolute
position
of
text
within
the
bounding
box
is
less
critical,
although
it
is
important
to
be
able
to
avoid
bad
word-breaks
and
have
adequate
white
space
around
letters
and
so
on.
There
is
more
on
this
in
a
separate
requirement.
The
caption
format
could
provide
a
min-width/min-height
for
its
bounding
box,
which
typically
is
calculated
from
the
bottom
of
the
video
viewport,
but
can
be
placed
elsewhere
by
the
Web
web
page,
with
the
Web
web
page
being
able
to
make
that
box
larger
and
scale
the
text
relatively,
too.
The
positions
inside
the
box
should
probably
be
into
regions,
such
as
top,
right,
bottom,
left,
center.
[CC-6]
Support
the
display
of
multiple
regions
of
text
simultaneously.
Note
This
typically
relates
to
multiple
text
cues
that
are
defined
on
overlapping
times.
If
the
cues'
rendering
target
are
made
out
to
different
spatial
regions,
they
can
be
displayed
simultaneously.
[CC-7]
Display
multiple
rows
of
text
when
rendered
as
text
in
a
right-to-left
or
left-to-right
language.
Note
Internationalization
is
important
not
just
for
subtitles,
as
captions
can
be
used
in
all
languages.
[CC-8]
Allow
the
author
to
specify
line
breaks.
[CC-9]
Permit
a
range
of
font
faces
and
sizes.
[CC-10]
Render
a
background
in
a
range
of
colors,
supporting
a
full
range
of
opacities.
opacity
levels.
[CC-11]
Render
text
in
a
range
of
colors.
Note
The
user
should
have
final
control
over
rendering
styles
like
color
and
fonts;
e.g.,
through
user
preferences.
[CC-12]
Enable
rendering
of
text
with
a
thicker
outline
or
a
drop
shadow
to
allow
for
better
contrast
with
the
background.
[CC-13]
Where
a
background
is
used,
it
is
preferable
to
keep
the
caption
background
visible
even
in
times
where
no
text
is
displayed,
such
that
it
minimises
minimizes
distraction.
However,
where
captions
are
infrequent
the
background
should
be
allowed
to
disappear
to
enable
the
user
to
see
as
much
of
the
underlying
video
as
possible.
Note
It
may
be
technically
possible
to
have
cues
without
text.
[CC-14]
Allow
the
use
of
mixed
display
styles—
e.g.,
mixing
paint-on
captions
with
pop-on
captions—
within
a
single
caption
cue
or
in
the
caption
stream
as
a
whole.
Pop-on
captions
are
usually
one
or
two
lines
of
captions
that
appear
on
screen
and
remain
visible
for
one
to
several
seconds
before
they
disappear.
Paint-on
captions
are
individual
characters
that
are
"painted
on"
from
left
to
right,
not
popped
onto
the
screen
all
at
once,
and
usually
are
verbatim.
Another
often-used
caption
style
in
live
captioning
is
roll-up
-
here,
cue
text
follows
double
chevrons
("greater
than"
symbols),
and
are
used
to
indicate
different
speaker
identifications.
Each
sentence
"rolls
up"
to
about
three
lines.
The
top
line
of
the
three
disappears
as
a
new
bottom
line
is
added,
allowing
the
continuous
rolling
up
of
new
lines
of
captions.
Note
Similarly,
in
karaoke,
individual
characters
are
often
"painted
on".
[CC-15]
Support
positioning
such
that
the
lowest
line
of
captions
appears
at
least
1/12
of
the
total
screen
height
above
the
bottom
of
the
screen,
when
rendered
as
text
in
a
right-to-left
or
left-to-right
language.
[CC-16]
Use
conventions
that
include
inserting
left-to-right
and
right-to-left
segments
within
a
vertical
run
(e.g.
Tate-chu-yoko
in
Japanese),
when
rendered
as
text
in
a
top-to-bottom
oriented
language.
[CC-17]
Represent
content
of
different
natural
languages.
In
some
cases
the
inclusion
of
a
few
foreign
words
form
part
of
the
original
soundtrack,
and
thus
need
to
be
in
the
same
caption
resource.
Also
allow
for
separate
caption
files
for
different
languages
and
on-the-fly
switching
between
them.
This
is
also
a
requirement
for
subtitles.
Note
Caption/subtitle
files
that
are
alternatives
in
different
languages
are
probably
best
provided
in
different
caption
resources
and
are
user
selectable.
Realistically,
having
no
more
than
2
languages
present
at
the
same
time
on
screen
is
probably
the
limit.
[CC-18]
Represent
content
of
at
least
those
specific
natural
languages
that
may
be
represented
with
[Unicode
3.2],
including
common
typographical
conventions
of
that
language
(e.g.,
through
the
use
of
furigana
and
other
forms
of
ruby
text).
[CC-19]
Present
the
full
range
of
typographical
glyphs,
layout
and
punctuation
marks
normally
associated
with
the
natural
language's
print-writing
system.
[CC-20]
Permit
in-line
mark-up
for
foreign
words
or
phrases.
Note
Italics
markup
may
be
sufficient
for
a
human
user,
but
it
is
important
to
be
able
to
mark
up
languages
so
that
the
text
can
be
rendered
correctly,
since
the
same
Unicode
can
be
shared
between
languages
and
rendered
differently
in
different
contexts.
This
is
mainly
an
I18n
localization
issue.
It
is
also
important
for
audio
rendering,
to
get
pronunciation
correct.
correct
pronunciation.
[CC-21]
Permit
the
distinction
between
different
speakers.
Further,
systems
that
support
captions
must:
[CC-22]
Support
captions
that
are
provided
inside
media
resources
as
tracks,
or
in
external
files.
Note
It
is
desirable
to
expose
the
same
API
to
both.
[CC-23]
Ascertain
that
captions
are
displayed
in
sync
with
the
media
resource.
[CC-24]
Support
user
activation/deactivation
of
caption
tracks.
Note
This
requires
a
menu
of
some
sort
that
displays
the
available
tracks
for
activation/deactivation.
[CC-25]
Support
edited
and
verbatim
captions,
if
available.
Note
Edited
and
verbatim
captions
can
be
provided
in
two
different
caption
resources.
There
is
a
need
to
expose
to
the
user
how
they
differ,
similar
to
how
there
can
be
caption
tracks
in
different
languages.
[CC-26]
Support
multiple
tracks
of
foreign-language
subtitles
in
different
languages.
Note
These
different-language
"tracks"
can
be
provided
in
different
resources.
[CC-27]
Support
live-captioning
functionality.
3.7
Enhanced
captions/subtitles
Enhanced
captions
are
timed
text
cues
that
have
been
enriched
with
further
information
-
examples
are
glossary
definitions
for
acronyms
and
other
intialisms,
foreign
terms
(for
example,
Latin),
jargon
or
descriptions
for
other
difficult
language.
They
may
be
age-graded,
so
that
multiple
caption
tracks
are
supplied,
or
the
glossary
function
may
be
added
dynamically
through
machine
lookup.
Glossary
information
can
be
added
in
the
normal
time
allotted
for
the
cue
(e.g.,
as
a
callout
or
other
overlay),
or
it
might
take
the
form
of
a
hyperlink
that,
when
activated,
pauses
the
main
content
and
allows
access
to
more
complete
explanatory
material.
Such
extensions
can
provide
important
additional
information
to
the
content
that
will
enable
or
improve
the
understanding
of
the
main
content
to
accessibility
users.
users
of
assistive
technology.
Enhanced
text
cues
will
be
particularly
useful
for
those
with
restricted
reading
skills,
to
subtitle
users,
and
to
caption
users.
Users
may
often
come
across
keywords
in
text
cues
that
lend
themselves
to
further
in-depth
information
or
hyperlinks,
such
as
an
e-mail
contact
or
phone
number
for
a
person,
a
strange
term
that
needs
a
Wikipedia
link
for
to
a
definition,
or
an
idiom
that
needs
comments
to
explain
it
to
a
foreign-language
speaker.
Systems
that
support
enhanced
captions
must:
[ECC-1]
Support
metadata
markup
for
(sections
of)
timed
text
cues.
Note
Such
"metadata"
markup
can
be
realised
realized
through
a
@title
attribute
on
a
<span>
of
the
text,
or
a
hyperlink
to
another
location
where
a
term
is
explained,
an
<abbr>
element,
an
<acronym>
element,
a
<dfn>
element,
or
through
RDFa
or
microdata.
[ECC-2]
Support
hyperlinks
and
other
activation
mechanisms
for
supplementary
data
for
(sections
of)
caption
text.
Note
This
can
be
realised
realized
through
inclusion
of
<a>
elements
links
or
buttons
into
timed
text
cues,
where
additional
overlays
could
be
created
or
a
different
page
be
loaded.
One
needs
to
deal
here
with
the
need
to
pause
the
media
timeline
for
reading
of
the
additional
information.
[ECC-3]
Support
text
cues
that
may
be
longer
than
the
time
available
until
the
next
text
cue
and
thus
provide
overlapping
text
cues
-
in
this
case,
a
feature
should
be
provided
to
decide
if
overlap
is
ok
or
should
be
cut
or
the
media
resource
be
paused
while
the
caption
is
displayed.
Timing
would
be
provided
by
the
author,
but
with
the
user
being
able
to
override
it.
Note
This
feature
is
analogous
to
extended
video
descriptions
-
where
timing
for
a
text
cue
is
longer
than
the
available
time
for
the
cue,
it
may
be
necessary
to
halt
the
media
to
allow
for
more
time
to
read
back
on
the
text
and
its
additional
material.
In
this
case,
the
pause
is
dependent
on
the
user's
reading
speed,
so
this
may
imply
user
control
or
timeouts.
[ECC-4]
It
needs
to
be
possible
to
define
timed
text
cues
that
are
allowed
to
overlap
with
each
other
in
time
and
be
present
on
screen
at
the
same
time
(e.g.,
those
that
come
from
speech
of
different
speakers),
and
such
that
are
not
allowed
to
overlap
and
thus
cause
media
playback
pause
to
allow
users
to
catch
up
with
their
reading.
This
could
be
realised
through
a
hint
on
the
text
cue
or
even
for
a
whole
track.
[ECC-5]
Allow
users
to
define
the
reading
speed
and
thus
define
how
long
each
text
cue
requires,
and
whether
media
playback
needs
to
pause
sometimes
to
let
them
catch
up
on
their
reading.
Note
This
can
be
a
setting
in
the
UA,
which
will
define
user-interface
behavior.
3.8
Sign
translation
Sign
language
shares
the
same
concept
as
captioning:
it
presents
both
speech
and
non-speech
information
in
an
alternative
format.
Note
that
due
to
the
wide
regional
variation
in
signing
systems
(e.g.,
American
Sign
Language
vs
British
Sign
Language),
sign
translation
may
not
be
appropriate
for
content
with
a
global
audience
unless
localized
variants
can
be
made
available.
Signing
can
be
open,
mixed
with
the
video
and
offered
as
an
entirely
alternate
alternative
stream
or
closed
(using
some
form
of
picture-in-picture
or
alpha-blending
technology).
It
is
possible
to
use
quite
low
bit
rates
for
much
of
the
signing
track,
but
it
is
important
that
facial,
arm,
hand
and
other
body
gestures
be
delivered
at
sufficient
resolution
to
support
legibility.
Animated
avatars
may
not
currently
be
sufficient
as
a
substitute
for
human
signers,
although
research
continues
in
this
area
and
it
may
become
practical
at
some
point
in
the
future.
Acknowledging
that
not
all
devices
will
be
capable
of
handling
multiple
video
streams,
this
is
a
should
SHOULD
requirement
for
browsers
where
hardware
is
capable
of
support.
Strong
authoring
guidance
for
content
creator
creators
will
mitigate
situations
where
user-agents
are
unable
to
support
multiple
video
streams
(WCAG)
-
for
example,
on
mobile
devices
that
cannot
support
multiple
streams,
authors
should
be
encouraged
to
offer
two
versions
of
the
media
stream,
including
one
with
signed
captions
burned
into
the
media.
Selecting
from
multiple
tracks
for
different
sign
languages
should
be
achieved
in
the
same
fashion
that
multiple
caption/subtitle
files
are
handled.
Systems
supporting
sign
language
must:
[SL-1]
Support
sign-language
video
either
as
a
track
as
part
of
a
media
resource
or
as
an
external
file.
[SL-2]
Support
the
synchronized
playback
of
the
sign-language
video
with
the
media
resource.
[SL-3]
Support
the
display
of
sign-language
video
either
as
picture-in-picture
or
alpha-blended
overlay,
as
parallel
video,
or
as
the
main
video
with
the
original
video
as
picture-in-picture
or
alpha-blended
overlay.
Parallel
video
here
means
two
discrete
videos
playing
in
sync
with
each
other.
It
is
preferable
to
have
one
discrete
<video>
element
contain
all
pieces
for
sync
purposes
rather
than
specifying
multiple
<video>
elements
intended
to
work
in
sync.
[SL-4]
Support
multiple
sign-language
tracks
in
several
sign
languages.
[SL-5]
Support
the
interactive
activation/deactivation
of
a
sign-language
track
by
the
user.
3.9
Transcripts
While
synchronized
captions
are
generally
preferable
for
people
with
hearing
impairments,
for
some
users
they
are
not
viable
–
those
who
are
deaf-blind,
for
example,
or
those
with
cognitive
or
reading
impairments
that
make
it
impossible
to
follow
synchronized
captions.
And
even
with
ordinary
captions,
it
is
possible
to
miss
some
information
as
the
captions
and
the
video
require
two
separate
loci
of
attention.
The
full
transcript
supports
different
user
needs
and
is
not
a
replacement
for
captioning.
A
transcript
can
either
be
presented
simultaneously
with
the
media
material,
which
can
assist
slower
readers
or
those
who
need
more
time
to
reference
context,
but
it
should
also
be
made
available
independently
of
the
media.
A
full
text
transcript
should
include
information
that
would
be
in
both
the
caption
and
video
description,
so
that
it
is
a
complete
representation
of
the
material,
as
well
as
containing
any
interactive
options.
Systems
supporting
transcripts
must:
[T-1]
Support
the
provisioning
of
a
full
text
transcript
for
the
media
asset
in
a
separate
but
linked
resource,
where
the
linkage
is
programatically
programmatically
accessible
to
AT
.
[T-2]
Support
the
provisioning
of
both
scrolling
and
static
display
of
a
full
text
transcript
with
the
media
resource,
e.g.,
in
a
an
area
next
to
the
video
or
underneath
the
video,
which
is
also
AT
accessible.
[T-3]
Allow
the
user
to
customize
the
visual
rendering
of
the
full
text
transcript,
e.g.,
font,
font
size,
foreground
and
background
color,
line,
letter,
and
word
spacing.
4.
System
Requirements
4.2
Granularity
level
control
for
structural
navigation
As
explained
in
"Content
Navigation"
above,
a
real-time
control
mechanism
must
be
provided
for
adjusting
the
granularity
of
the
specific
structural
navigation
point
next
and
previous.
Users
must
be
able
to
set
the
range/scope
of
next
and
previous
in
real
time.
[CNS-1]
All
identified
structures,
including
ancillary
content
as
defined
in
"Content
Navigation"
above,
must
be
accessible
with
the
use
of
"next"
and
"previous,"
as
refined
by
the
granularity
control.
[CNS-2]
Users
must
be
able
to
discover,
skip,
play-in-line,
or
directly
access
ancillary
content
structures.
[CNS-3]
Users
need
to
be
able
to
access
the
granularity
control
using
any
input
mode,
e.g.,
keyboard,
speech,
pointer,
etc.
[CNS-4]
Producers
and
authors
may
optionally
provide
additional
access
options
to
identified
structures,
such
as
direct
access
to
any
node
in
a
table
of
contents.
4.3
Time-scale
modification
While
all
devices
may
not
support
the
capability,
a
standard
control
API
must
support
the
ability
to
speed
up
or
slow
down
content
presentation
without
altering
audio
pitch.
Note
While
perhaps
unfamiliar
to
some,
this
feature
has
been
present
on
many
devices,
especially
audiobook
players,
for
some
20
years
now.
The
user
can
adjust
the
playback
rate
of
prerecorded
time-based
media
content,
such
that
all
of
the
following
are
true
(UAAG
2.0
4.9.5):
2.11.4):
[TSM-1]
The
user
can
adjust
the
playback
rate
of
the
time-based
media
tracks
to
between
50%
and
250%
of
real
time.
[TSM-2]
Speech
whose
playback
rate
has
been
adjusted
by
the
user
maintains
pitch
in
order
to
limit
degradation
of
the
speech
quality.
[TSM-3]
All
provided
alternative
media
tracks
remain
synchronized
across
this
required
range
of
playback
rates.
[TSM-4]
The
user
agent
provides
a
function
that
resets
the
playback
rate
to
normal
(100%).
[TSM-5]
The
user
can
stop,
pause,
and
resume
rendered
audio
and
animation
content
(including
video
and
animated
images)
that
last
three
or
more
seconds
at
their
default
playback
rate.
(UAAG
2.0
4.9.6)
2.11.5)
4.4
Production
practice
and
resulting
requirements
One
of
the
biggest
problems
challenges
to
date
has
been
the
lack
of
a
universal
system
for
media
access.
In
response
to
user
requirements
various
countries
and
groups
have
defined
systems
to
provide
accessibility,
especially
captioning
for
television.
However
these
systems
are
typically
not
compatible.
In
some
cases
the
formats
can
be
inter-converted,
but
some
formats
—
for
example
DVD
sub-pictures
—
are
image
based
and
are
difficult
to
convert
to
text.
Caption
formats
are
often
geared
towards
delivery
of
the
media,
for
example
as
part
of
a
television
broadcast.
They
are
not
well
suited
to
the
production
phases
of
media
creation.
Media
creators
have
developed
their
own
internal
formats
which
are
more
amenable
to
the
editing
phase,
but
to
date
there
has
been
no
common
format
that
allows
interchange
of
this
data.
Any
media
based
solution
should
attempt
to
reduce
as
far
as
possible
layers
of
translation
between
production
and
delivery.
In
general
captioners
use
a
proprietary
workstation
to
prepare
caption
files;
these
can
often
export
to
various
standard
broadcast
ingest
formats,
but
in
general
files
are
not
inter-convertible.
Most
video
editing
suites
are
not
set
up
to
preserve
captioning,
and
so
this
has
typically
to
be
added
after
the
final
edit
is
decided
on;
furthermore
since
this
work
is
often
outsourced,
the
copyright
holder
may
not
hold
the
final
editable
version
of
the
captions.
Thus
when
programming
is
later
re-purposed,
e.g.
a
shorter
edit
is
made,
or
a
‘directors
cut’
produced,
the
captioning
may
have
to
be
redone
in
its
entirety.
Similarly,
and
particularly
for
news
footage,
parts
of
the
media
may
go
to
web
before
the
final
TV
edit
is
made,
and
thus
the
captions
that
are
produced
for
the
final
TV
edit
are
not
available
for
the
web
version.
It
is
important
when
purchasing
or
commissioning
media,
that
captioning
and
described
video
is
taken
into
account
and
made
equal
priority
in
terms
of
ownership,
rights
of
use,
etc.,
as
the
video
and
audio
itself.
This
is
primarily
an
authoring
requirement.
It
is
a
understood
that
a
common
time-stamp
format
must
be
declared
in
HTML5,
so
that
authoring
tools
can
conform
to
a
required
output.
Systems
supporting
accessibility
needs
for
media
must:
[PP-1]
Support
existing
production
practice
for
alternative
content
resources,
in
particular
allow
for
the
association
of
separate
alternative
content
resources
to
media
resources.
Browsers
cannot
support
all
forms
of
time-stamp
formats
out
there,
just
as
they
cannot
support
all
forms
of
image
formats
(etc.).
This
necessitates
a
clear
and
unambiguous
declared
format,
so
that
existing
authoring
tools
can
be
configured
to
export
finished
files
in
the
required
format.
[PP-2]
Support
the
association
of
authoring
and
rights
metadata
with
alternative
content
resources,
including
copyright
and
usage
information.
[PP-3]
Support
the
simple
replacement
of
alternative
content
resources
even
after
publishing.
This
is
again
dependent
on
authoring
practice
-
if
the
content
creator
delivers
a
final
media
file
that
contains
related
accessibility
content
inside
the
media
wrapper
(for
example
an
MP4
file),
then
it
will
require
an
appropriate
third-party
authoring
tool
to
make
changes
to
that
file
-
it
cannot
be
demanded
of
the
browser
to
do
so.
[PP-4]
Typically,
alternative
content
resources
are
created
by
different
entities
to
the
ones
that
create
the
media
content.
They
may
even
be
in
different
countries
and
not
be
allowed
to
re-publish
the
other
one's
content.
It
is
important
to
be
able
to
host
these
resources
separately,
associate
them
together
through
the
Web
web
page
author,
and
eventually
play
them
back
synchronously
to
the
user.
4.5
Discovery
and
activation/deactivation
of
available
alternative
content
by
the
user
As
described
above,
individuals
need
a
variety
of
media
(alternative
content)
in
order
to
perceive
and
understand
the
content.
The
author
or
some
Web
web
mechanism
provides
the
alternative
content.
This
alternative
content
may
be
part
of
the
original
content,
embedded
within
the
media
container
as
'fallback
content',
or
linked
from
the
original
content.
The
user
is
faced
with
discovering
the
availability
of
alternative
content.
Alternative
content
must
be
both
discoverable
by
the
user,
and
accessible
in
device
agnostic
ways.
The
development
of
APIs
and
user-agent
controls
should
adhere
to
the
following
UAAG
guidance:
The
user
agent
can
facilitate
the
discovery
of
alternative
content
by
following
these
criteria:
[DAC-1]
The
user
has
the
ability
to
have
indicators
rendered
along
with
rendered
elements
that
have
alternative
content
(e.g.,
visual
icons
rendered
in
proximity
of
content
which
has
short
text
alternatives,
long
descriptions,
or
captions).
In
cases
where
the
alternative
content
has
different
dimensions
than
the
original
content,
the
user
has
the
option
to
specify
how
the
layout/reflow
of
the
document
should
be
handled.
(UAAG
2.0
3.1.1).
1.8.7).
[DAC-2]
The
user
has
a
global
option
to
specify
which
types
of
alternative
content
by
default
and,
in
cases
where
the
alternative
content
has
different
dimensions
than
the
original
content,
how
the
layout/reflow
of
the
document
should
be
handled.
(UAAG
2.0
3.1.2).
1.8.7).
[DAC-3]
The
user
can
browse
the
alternatives
and
switch
between
them.
[DAC-4]
Synchronized
alternatives
for
time-based
media
(e.g.,
captions,
descriptions,
sign
language)
can
be
rendered
at
the
same
time
as
their
associated
audio
tracks
and
visual
tracks
(UAAG
2.0
3.1.3).
2.11.4).
[DAC-5]
Non-synchronized
alternatives
(e.g.,
short
text
alternatives,
long
descriptions)
can
be
rendered
as
replacements
for
the
original
rendered
content
(UAAG
2.0
3.1.3).
1.1.3).
[DAC-6]
Provide
the
user
with
the
global
option
to
configure
a
cascade
of
types
of
alternatives
to
render
by
default,
in
case
a
preferred
alternative
content
type
is
unavailable
(UAAG
2.0
3.1.4).
1.1.4).
[DAC-7]
During
time-based
media
playback,
the
user
can
determine
which
tracks
are
available
and
select
or
deselect
tracks.
These
selections
may
override
global
default
settings
for
captions,
descriptions,
etc.
(UAAG
2.0
4.9.8)
[DAC-8]
Provide
the
user
with
the
option
to
load
time-based
media
content
such
that
the
first
frame
is
displayed
(if
video),
but
the
content
is
not
played
until
explicit
user
request.
(UAAG
2.0
4.9.2)
2.11.2)
[DAC-9]
Provide
the
user
with
the
option
to
record
alternative
content
along
with
the
primary
content
on
devices
where
recording
is
available.
Note
This
feature
can
be
user
configurable
to
allow
maximum
flexibility
in
trading
off
the
anticipated
future
need
for
the
description
against
the
amount
of
extra
data
storage
required.
A
flexible
solution
giving
maximum
control
to
the
user
would
be
to
provide
a
global
setting
with
the
following
options:
-
Always
record
the
alternative
content
(the
best
default
option,
since
a
resource
recorded
by
one
user
may
later
be
accessed
by
another
different
user
who
may
have
different
and
unanticipated
requirements);
-
Record
the
alternative
content
only
if
it
is
active
at
the
time
of
recording;
-
Ask
at
recording
time
whether
to
record
the
alternative
content;
-
Never
record
the
alternative
content.
4.6
Requirements
on
making
properties
available
to
the
accessibility
interface
Often
forgotten
in
media
systems,
especially
with
the
newer
forms
of
packaging
such
as
DVD
menus
and
on-screen
program
guides,
is
the
fact
that
the
user
needs
to
actually
get
to
the
content,
control
its
playback,
and
turn
on
any
required
accessibility
options.
For
user
agents
supporting
accessibility
APIs
implemented
for
a
platform,
any
media
controls
need
to
be
connected
to
that
API.
On
self-contained
products
that
do
not
support
assistive
technology,
any
menus
in
the
content
need
to
provide
information
in
alternative
formats
(e.g.,
talking
menus).
Products
with
a
separate
remote
control,
or
that
are
self-contained
boxes,
should
ensure
the
physical
design
does
not
block
access,
and
should
make
accessibility
controls,
such
as
the
closed-caption
toggle,
as
prominent
as
the
volume
or
channel
controls.
[API-1]
The
existence
of
alternative-content
tracks
for
a
media
resource
must
be
exposed
to
the
user
agent.
[API-2]
Since
authors
will
need
access
to
the
alternative
content
tracks,
the
structure
needs
to
be
exposed
to
authors
as
well,
which
requires
a
dynamic
interface.
[API-3]
Accessibility
APIs
need
to
gain
access
to
alternative
content
tracks
no
matter
whether
those
content
tracks
come
from
within
a
resource
or
are
combined
through
markup
on
the
page.
4.7
Requirements
on
the
use
of
the
viewport
The
video
viewport
plays
a
particularly
important
role
with
respect
to
alternative-content
technologies.
Mostly
it
provides
a
bounding
box
for
many
of
the
visually
represented
alternative-content
technologies
(e.g.,
captions,
hierarchical
navigation
points,
sign
language),
although
some
alternative
content
does
not
rely
on
a
viewport
(e.g.,
full
transcripts,
descriptive
video).
One
key
principle
to
remember
when
designing
player
‘skins’
is
that
the
lower-third
of
the
video
may
be
needed
for
caption
text.
Caption
consumers
rely
on
being
able
to
make
fast
eye
movements
between
the
captions
and
the
video
content.
If
the
captions
are
in
a
non-standard
place,
this
may
cause
viewers
to
miss
information.
The
use
of
this
area
for
things
such
as
transport
controls,
while
appealing
aesthetically,
may
lead
to
accessibility
conflicts.
[VP-1]
It
must
be
possible
to
deal
with
three
different
cases
for
the
relation
between
the
viewport
size,
the
position
of
media
and
of
alternative
content:
-
the
alternative
content's
extent
is
specified
in
relation
to
the
media
viewport
(e.g.,
picture-in-picture
video,
lower-third
captions)
-
the
alternative
content
has
its
own
independent
extent,
but
is
positioned
in
relation
to
the
media
viewport
(e.g.,
captions
above
the
audio,
sign-language
video
above
the
audio,
navigation
points
below
the
controls)
-
the
alternative
content
has
its
own
independent
extent
and
doesn't
need
to
be
rendered
in
any
relation
to
the
media
viewport
(e.g.,
text
transcripts)
If
alternative
content
has
a
different
height
or
width
than
the
media
content,
then
the
user
agent
will
reflow
the
(HTML)
viewport.
(UAAG
2.0
3.1.4).
1.8.7).
Note
This
may
create
a
need
to
provide
an
author
hint
to
the
Web
web
page
when
embedding
alternate
alternative
content
in
order
to
instruct
the
Web
web
page
how
to
render
the
content:
to
scale
with
the
media
resource,
scale
independently,
or
provide
a
position
hint
in
relation
to
the
media.
On
small
devices
where
the
video
takes
up
the
full
viewport,
only
limited
rendering
choices
may
be
possible,
such
that
the
UA
may
need
to
override
author
preferences.
[VP-2]
The
user
can
change
the
following
characteristics
of
visually
rendered
text
content,
overriding
those
specified
by
the
author
or
user-agent
defaults
(UAAG
2.0
3.6.1).
1.4.1).
(Note:
this
should
include
captions
and
any
text
rendered
in
relation
to
media
elements,
so
as
to
be
able
to
magnify
and
simplify
rendered
text):
-
text
scale
(i.e.,
the
general
size
of
text),
-
font
family,
and
family
-
text
color
(i.e.,
foreground
and
background).
background)
-
letter
spacing
(tracking
and
kerning)
-
line
spacing
(or
line
height),
and
-
word
spacing.
Note
This
should
be
achievable
through
UA
configuration
or
even
through
something
like
a
greasemonkey
script
or
user
CSS
which
can
override
styles
dynamically
in
the
browser.
[VP-3]
Provide
the
user
with
the
ability
to
adjust
the
size
of
the
time-based
media
up
to
the
full
height
or
width
of
the
containing
viewport,
with
the
ability
to
preserve
aspect
ratio
and
to
adjust
the
size
of
the
playback
viewport
to
avoid
cropping,
within
the
scaling
limitations
imposed
by
the
media
itself.
(UAAG
2.0
4.9.9)
1.8.9)
Note
This
can
be
achieved
by
simply
zooming
into
the
Web
web
page,
which
will
automatically
rescale
the
layout
and
reflow
the
content.
[VP-4]
Provide
the
user
with
the
ability
to
control
the
contrast
and
brightness
of
the
content
within
the
playback
viewport.
(UAAG
2.0
4.9.11)
2.11.8)
Note
This
is
a
user-agent
device
requirement
and
should
already
be
addressed
in
the
UAAG.
In
live
content,
it
may
even
be
possible
to
adjust
camera
settings
to
achieve
this
requirement.
It
is
also
a
"
should
SHOULD
"
level
requirement,
since
it
does
not
account
for
limitations
of
various
devices.
[VP-5]
Captions
and
subtitles
traditionally
occupy
the
lower
third
of
the
video,
where
also
controls
are
also
usually
rendered.
The
user
agent
must
avoiding
avoid
overlapping
of
overlay
content
and
controls
on
media
resources.
This
must
also
happen
if,
for
example,
the
controls
are
only
visible
on
demand.
Note
If
there
are
several
types
of
overlapping
overlays,
the
controls
should
stay
on
the
bottom
edge
of
the
viewport
and
the
others
should
be
moved
above
this
area,
all
stacked
above
each
other.
4.8
Requirements
on
the
parallel
use
of
alternate
content
on
potentially
multiple
secondary
screens
and
other
devices
in
parallel
Multiple
secondary
user
devices
must
be
directly
addressable.
This
functionality
is
increasingly
also
known
by
the
new
term,
"Second
Screen,"
even
though
there
may
be
more
than
two
screens
in
any
given
viewing
environment,
and
even
though
not
all
secondary
devices
are
video
displays.
It
must
be
assumed
that
many
users
will
have
multiple
video
displays
at
least
one
additional
display
device
(such
as
a
tablet),
and/or
multiple
audio-output
devices
at
least
one
additional
audio
output
device
(such
as
a
Bluetooth
headset)
attached
to
a
primary
video
display
device,
an
individual
computer,
or
locally
addressable
via
on
a
LAN.
It
must
be
possible
to
configure
certain
types
of
media
for
presentation
on
specific
devices,
and
these
configuration
settings
must
be
readily
overwritable
on
a
case-by-case
basis
by
users.
(A
request
to
the
UAAG
on
clarifications
to
a
number
of
these
points
was
made,
and
a
detailed
response
was
provided.
The
response
requires
review
and
integration
into
this
document,
but
can
be
found
today
in
the
22
July
2010
message
on
this
topic
).
Systems
supporting
multiple
secondary
devices
for
accessibility
must:
[MD-1]
[SD-1]
Support
a
platform-accessibility
architecture
relevant
to
the
operating
environment.
(UAAG
2.0
2.1.1)
4.1.1)
[MD-2]
[SD-2]
Ensure
accessibility
of
all
user-interface
components
including
the
user
interface,
rendered
content,
and
alternative
content;
make
available
the
name,
role,
state,
value,
and
description
via
a
platform-accessibility
architecture.
(UAAG
2.0
2.1.2)
4.1.2)
[MD-3]
[SD-3]
If
a
feature
is
not
supported
by
the
accessibility
architecture(s),
provide
an
equivalent
feature
that
does
support
the
accessibility
architecture(s).
Document
the
equivalent
feature
in
the
conformance
claim.
(UAAG
2.0
2.1.3)
4.1.3)
[MD-4]
[SD-4]
If
the
user
agent
implements
one
or
more
DOMs,
they
must
be
made
programmatically
available
to
assistive
technologies.
(UAAG
2.0
2.1.4)
4.1.4)
This
assumes
the
video
element
will
write
to
the
DOM
.
DOM.
[MD-5]
[SD-5]
If
the
user
can
modify
the
state
or
value
of
a
piece
of
content
through
the
user
interface
(e.g.,
by
checking
a
box
or
editing
a
text
area),
the
same
degree
of
write
access
is
available
programmatically
(UAAG
2.0
2.1.5).
4.1.5).
[MD-6]
[SD-6]
If
any
of
the
following
properties
are
supported
by
the
accessibility-platform
architecture,
make
the
properties
available
to
the
accessibility-platform
architecture
(UAAG
2.0
2.1.6):
4.1.6):
-
the
bounding
dimensions
and
coordinates
of
rendered
graphical
objects;
-
font
family;
-
font
size;
-
text
foreground
color;
-
text
background
color;
-
change
state/value
notifications.
[MD-7]
[SD-7]
Ensure
that
programmatic
exchanges
between
APIs
proceed
at
a
rate
such
that
users
do
not
perceive
a
delay.
(UAAG
2.0
2.1.7).
4.1.7).