Editorial note: This section is a rough draft. It will be edited to align with
People with Disabilities Use the Web once that document is complete.
This draft is included now to provide general background
for sections 2 and 3 of this document.
Comprehension of media may be affected by loss of visual function, loss
of audio function, cognitive issues, or a combination of all three.
Cognitive disabilities may affect access to and/or
comprehension of media. Physical disabilities such as dexterity impairment,
loss of limbs, or loss of use of limbs may affect access to media. Once richer
forms of media, such as virtual reality, become more commonplace, tactile
issues may come into play. Control of the media player can be an important
issue, e.g., for people with physical disabilities, however this is typically not addressed
by the media formats themselves, but is a requirement of the technology used
to build the player.
People who are blind cannot access information if it is presented only
in the visual mode. They require information in an alternative representation,
which typically means the audio mode, although information can also be
presented as text. It is important to remember that not only the main video
is inaccessible, but any other visible ancillary information such as stock
tickers, status indicators, or other on-screen graphics, as well as any
visual controls needed to operate the content. Since people who are blind
use a screen reader and/or refreshable braille display, these assistive
technologies (ATs) need to work hand-in-hand with the access mechanism
provided for the media content.
2.2 Low vision
People with low vision can use some visual information.
Depending on their visual
ability they might have specific issues such as difficulty discriminating
foreground information from background information, or discriminating colors.
Glare caused by excessive scattering in the eye can be a significant challenge,
especially for very bright content or surroundings. They may be unable
to react quickly to transient information, and may have a narrow angle
of view and so may not detect key information presented temporarily where
they are not looking, or in text that is moving or scrolling. A person
will likely use screen magnification software. This means that they will only be viewing
a portion of the screen, and so must manage tracking media content via
their AT. They may have difficulty reading when text is too small, has
poor background contrast (too high or too low), or when outlined or other fancy font types or
effects are used. If the font is an image, it is likely to appear grainy when magnified.
They may be using an AT that adjusts all the colors of
the screen, such as inverting the colors, so the media content must be
viewable through the AT. Users with low vision will often benefit from the same
text streams and instructions that are sometimes hidden or displayed off screen for
users of screen readers or refreshable Braille.
2.3 Atypical color perception
A significant percentage of the population has atypical color perception,
and may not be able to discriminate between different colors, or may miss
key information when coded with color only. They might have difficulty discriminating
foreground information from background information, or discriminating colors. Such issues
can be minimized when the user has the ability to customize the color and contrast of text content.
People who are deaf generally cannot use audio. Thus, an alternative representation
is required, typically through synchronized captions and/or sign translation.
2.5 Hard of hearing
People who are hard of hearing may be able to use some audio material,
but might not be able to discriminate certain types of sound, and may miss
any information presented as audio only if it contains frequencies they
can't hear, or is masked by background noise or distortion. They may miss
audio which is too quiet, or of poor quality. Speech may be challenging
if it is too fast and cannot be played back more slowly. Information presented
using multichannel audio (e.g., stereo) may not be perceived by people
who are deaf in one ear.
Individuals who are deaf-blind have a combination of conditions that may
result in one of the following: blindness and deafness; blindness and difficulty
in hearing; low vision and deafness; or low vision and difficulty in hearing.
Depending on their combination of conditions, individuals who are deaf-blind
may need captions that can be enlarged, changed to high-contrast colors,
or otherwise styled; or they may need captions and/or described video that
can be presented with AT (e.g., a refreshable braille display). They may
need synchronized captions and/or described video, or they may need a non-time-based
transcript which they can read at their own pace.
2.7 Physical impairment
People with physical disabilities such as poor dexterity, loss of limbs, or
loss of use of limbs may use the keyboard alone rather than the combination
of a pointing device plus keyboard to interact with content and controls,
or may use a switch with an on-screen keyboard, or other assistive technology.
The player itself must be usable via the keyboard and pointing
devices. The user must have full access to all player controls, including
methods for selecting alternative content.
2.8 Cognitive and neurological disabilities
Cognitive and neurological disabilities include a wide range of conditions
that may include intellectual disabilities (called learning disabilities
in some regions), autism-spectrum disorders, memory impairments, mental-health
disabilities, attention-deficit disorders, audio- and/or visual-perceptive
disorders, dyslexia and dyscalculia (called learning disabilities in some
regions), or seizure disorders. Necessary accessibility supports vary widely
for these different conditions. Individuals with some conditions may process
information aurally better than by reading text; therefore, information
that is presented as text embedded in a video should also be available
as audio descriptions. Individuals with other conditions may need to reduce
distractions or flashing in presentations of video. Some conditions such
as autism-spectrum disorders may have multi-system effects and individuals
may need a combination of different accommodation.
Overall, the media experience for people on the autism spectrum should
be customizable and well designed so as to not be overwhelming. Care
must be taken to present a media experience that focuses on the purpose
of the content and provides alternative content in a clear, concise manner.
3. Alternative Content Technologies
A number of alternative content types have been developed to help users
with sensory disabilities gain access to audio-visual content. This section
lists them, explains generally what they are, and provides a number of requirements
on each that need to be satisfied with technology developed in HTML5 around
the media elements.
3.1 Described video
Described video contains descriptive narration of key visual elements
designed to make visual media accessible to people who are blind or visually
impaired. The descriptions include actions, costumes, gestures, scene changeset
or any other important visual information that someone who cannot see the
screen might ordinarily miss. Descriptions are traditionally audio recordings
timed and recorded to fit into natural pauses in the program, although
they may also briefly obscure the main audio track. (See the section on
extended descriptions for an alternative approach.) The descriptions are
usually read by a narrator with a voice that cannot be easily confused
with other voices in the primary audio track. They are authored to convey
objective information (e.g., a yellow flower) rather than subjective judgments
(e.g., a beautiful flower).
As with captions, descriptions can be open or closed.
Open descriptions are merged with the program-audio
track and cannot be turned off by the viewer.
Closed descriptions can be turned on and off by the
viewer. They can be recorded as a separate track containing descriptions
only, timed to play at specific spots in the timeline and played in parallel
with the program-audio track.
- Some descriptions can be delivered as a separate audio channel
mixed in at the player.
- Other options include a computer-generated ‘text to speech’
track, also known as text video descriptions. This is described
in the next subsection.
Described video provides benefits that reach beyond blind or visually
impaired viewers; e.g., students grappling with difficult materials or
concepts. Descriptions can be used to give supplemental information about
what is on screen—the structure of lengthy mathematical equations or the
intricacies of a painting, for example.
Described video is available on some television programs and in many movie
theaters in the U.S. and other countries. Regulations in the U.S. and Europe
are increasingly focusing on description, especially for television, reflecting
its priority with citizens who have visual impairments. The technology
needed to deliver and render basic video descriptions is in fact relatively
straightforward, being an extension of common audio-processing solutions.
Playback products must support multi-audio channels required for description,
and any product dealing with broadcast TV content must provide adequate
support for descriptions. Descriptions can also provide text that can be
indexed and searched.
Systems supporting described video that are not open descriptions must:
[DV-1] Provide an indication that descriptions are available, and
[DV-2] Render descriptions in a time-synchronized manner, using
the media resource as the timebase master.
[DV-3] Support multiple description tracks (e.g., discrete tracks
containing different levels of detail).
[DV-4] Support recordings of real human speech as a track of the
media resource, or as an external file.
[DV-5] Allow the author to independently adjust the volumes of the
audio description and original soundtracks.
[DV-6] Allow the user to independently adjust the volumes of the
audio description and original soundtracks, with the user's settings
overriding the author's.
[DV-7] Permit smooth changes in volume rather than stepped changes.
The degree and speed of volume change should be under user control.
[DV-8] Allow the author to provide fade and pan controls to be accurately
synchronized with the original soundtrack.
[DV-9] Allow the author to use a codec which is optimized for voice
only, rather than requiring the same codec as the original soundtrack.
[DV-10] Allow the user to select from among different languages
of descriptions, if available, even if they are different from the
language of the main soundtrack.
[DV-11] Support the simultaneous playback of both the described
and non-described audio tracks so that one may be directed at separate
outputs (e.g., a speaker and headphones).
[DV-12] Provide a means to prevent descriptions from carrying over
from one program or channel when the user switches to a different program
[DV-13] Allow the user to relocate the description track within
the audio field, with the user setting overriding the author setting.
The setting should be re-adjustable as the media plays.
[DV-14] Support metadata, such as copyright information, usage rights,
3.2 Text video description
Described video that uses text for the description source rather than
a recorded voice creates specific requirements.
Text video descriptions (TVDs) are delivered to the client as text and
rendered locally by assistive technology such as a screen reader or a braille
device. This can have advantages for screen-reader users who want full
control of the preferred voice and speaking rate, or other options to control
the speech synthesis.
Text video descriptions are provided as text files containing start times
for each description cue. Since the duration that a screen reader takes
to read out a description cannot be determined during authoring of the
cues, it is difficult to ensure they don't obscure the main audio or other
description cues. This is likely to be caused by at least three reasons:
- An author of text video descriptions does not have a screen reader.
This means s/he cannot check if the description fits within the time
frame. Even if s/he has a screen reader, a user's screen reader will
be set to a different reading speed and may take longer to read the same
- Some screen-reader users (e.g., those who are elderly or have learning
disabilities) may slow down the speech rate.
- A visually complicated scene (e.g., figures on a blackboard in an
online physics class) may require more description time than is available
in the program-audio track.
People with low-vision may also benefit from having access to text video descriptions.
Systems supporting text video descriptions must:
[TVD-1] Support presentation of text video descriptions through
a screen reader, braille device and/or modified print with playback speed control,
voice control and synchronization points within the video.
[TVD-2] TVDs need to be provided in a format that contains the following
- start time, text per description cue (the duration is determined
dynamically, though an end time could provide a cut point)
- possibly a speech-synthesis markup to improve quality of
the description (existing speech synthesis markups include SSML and CSS 3 Speech Module)
- accompanying metadata providing labeling for speakers, language,
- visual style markup (see section on Captioning).
[TVD-3] Where possible, provide a text or separate audio track privately
to those that need it in a mixed-viewing situation, e.g., through headphones.
[TVD-4] Where possible, provide options for authors and users to
deal with the overflow case: continue reading, stop reading, and pause
the video. (One solution from a user's point of view may be to pause
the video and finish reading the TVD, for example.) User preference
should override authored option.
[TVD-5] Support the control over speech-synthesis playback speed,
volume and voice, and provide synchronization points with the video.
3.3 Extended video descriptions
Video descriptions are usually provided as recorded speech, timed to play
in the natural pauses in dialog or narration. In some types of material,
however, there is not enough time to present sufficient descriptions. To
meet such cases, the concept of extended description was developed. Extended
descriptions work by pausing the video and program audio at key moments,
playing a longer description than would normally be permitted, and then
resuming playback when the description is finished playing. This will naturally
extend the timeline of the entire presentation. This procedure has not
been possible in broadcast television; however, hard-disk recording and
on-demand Internet systems can make this a practical possibility.
Extended video description (EVD) has been reported to have benefits for
cognitive disabilities; for example, it might benefit people with Asperger
Syndrome and other Autistic Spectrum Disorders, in that it can make connections
between cause and effect, point out what is important to look at, or explain
moods that might otherwise be missed.
Systems supporting extended audio descriptions must:
Support detailed user control as specified in [TVD-4]
extended video descriptions.
[EVD-2] Support automatically pausing the video and main audio tracks
in order to play a lengthy description.
[EVD-3] Support resuming playback of video and main audio tracks
when the description is finished.
Because the user is the ultimate arbiter of the rate at which TTS
playback occurs, it is not feasible for an author to guarantee that any
texted audio description can be played within the natural pauses in
dialog or narration of the primary audio resource. Therefore, all texted
descriptions must be treated as extended text descriptions potentially
requiring the pausing and resumption of primary resource playback.
3.4 Clean audio
A relatively recent development in television accessibility is the concept
audio, which takes advantage of the increased adoption of multichannel
audio. This is primarily aimed at audiences who are hard of hearing, and
consists of isolating the audio channel containing the spoken dialog and
important non-speech information that can then be amplified or otherwise
modified, while other channels containing music or ambient sounds are attenuated.
Using the isolated audio track may make it possible to apply more sophisticated
audio processing such as pre-emphasis filters, pitch-shifting, and so on
to tailor the audio to the user's needs, since hearing loss is typically
frequency-dependent, and the user may have usable hearing in some bands
yet none at all in others.
Systems supporting clean audio and multiple audio tracks must:
[CA-1] Support clean
audio as a separate, alternative audio track from other audio-based
alternative media resources, including the primary audio resource.
[CA-2] Support the synchronization of multitrack audio either within
the same file or from separate files - preferably both.
[CA-3] Support separate volume control of the different audio tracks.
[CA-4] Support pre-emphasis filters, pitch-shifting, and other audio-processing
For people who are deaf or hard-of-hearing, captioning is a prime alternative
representation of audio. Captions are in the same language as the main
audio track and, in contrast to foreign-language subtitles, render a transcription
of dialog or narration as well as important non-speech information, such
as sound effects, music, and laughter. Historically, captions have been
either closed or open. Closed captions have been transmitted as data along
with the video but were not visible until the user elected to turn them
on, usually by invoking an on-screen control or menu selection. Open captions
have always been visible; they had been merged with the video track and
could not be turned off.
Ideally, captions should be a verbatim representation of the audio; however,
captions are sometimes edited for various reasons— for example, for reading
speed or for language level. In general, consumers of captions have expressed
that the text should represent exactly what is in the audio track. If edited
captions are provided, then they should be clearly marked as such, and
the full verbatim version should also be available as an option.
The timing of caption text can coincide with the mouth movement of the
speaker (where visible), but this is not strictly necessary. For timing
purposes, captions may sometimes precede or extend slightly after the audio
they represent. Captioning should also use adequate means to distinguish
between speakers as turn-taking occurs during conversation; this has in
the past been done by positioning the text near the speaker, by associating
different colors to different speakers, or by putting the name and a colon
in front of the text line of a speaker.
Captions are useful to a wide array of users in addition to their originally
intended audiences. Gyms, bars, and restaurants regularly employ captions
as a way for patrons to watch television while in those establishments.
People learning to read or learning the language of the country where they
live as a second language also benefit from captions: research has shown
that captions help reinforce vocabulary and language. Captions can also
provide a powerful search capability, allowing users and search engines
to search the caption text to locate a specific video or an exact point
in a video.
Formats for captions, subtitles or foreign-language subtitles must:
[CC-1] Render text in a time-synchronized manner,
using the media resource as the timebase master.
Most of the time, the main audio track would be the best candidate
for the timebase. Where a video without audio, but with a text track,
is available, the video track becomes the timebase master. Also, there
may be situations where an explicit timing track is available.
[CC-2] Allow the author to specify erasures, i.e.,
times when no text is displayed on the screen (no text cues are active).
This should be possible both within media resources and caption
[CC-3] Allow the author to assign timestamps so
that one caption/subtitle follows another, with no perceivable gap
This means that caption cues should be able to either let the
start time of the subsequent cue be determined by the duration of the
cue or have the end time be implied by the start of the next cue. For
overlapping captions, explicit start and end times are then required.
[CC-4] Be available in a text encoding.
This means that determined character encodings should be supported
- which could be either by making the character encoding explicit or
by enforcing a single default one such as UTF-8.
[CC-5] Support positioning in all parts of the
screen - either inside the media viewport but also possibly in a determined
space next to the media viewport. This is particularly important when
multiple captions are on screen at the same time and relate to different
speakers, or when in-picture text is avoided.
The minimum requirement is a bounding box (with an optional background)
into which text is flowed, and that probably needs to be pixel aligned.
The absolute position of text within the bounding box is less critical,
although it is important to be able to avoid bad word-breaks and have
adequate white space around letters and so on. There is more on this
in a separate requirement.
The caption format could provide a min-width/min-height for its bounding
box, which typically is calculated from the bottom of the video viewport,
but can be placed elsewhere by the web page, with the web page being
able to make that box larger and scale the text relatively, too. The
positions inside the box should probably be into regions, such as top,
right, bottom, left, center.
[CC-6] Support the display of multiple regions
of text simultaneously.
This typically relates to multiple text cues that are defined
on overlapping times. If the cues' rendering target are made out to different
spatial regions, they can be displayed simultaneously.
[CC-7] Display multiple rows of text when rendered
as text in a right-to-left or left-to-right language.
Internationalization is important not just for subtitles, as captions
can be used in all languages.
[CC-8] Allow the author to specify line breaks.
[CC-9] Permit a range of font faces and sizes.
[CC-10] Render a background in a range of colors,
supporting a full range of opacity levels.
[CC-11] Render text in a range of colors.
The user should have final control over rendering styles like
color and fonts; e.g., through user preferences.
[CC-12] Enable rendering of text with a thicker
outline or a drop shadow to allow for better contrast with the background.
[CC-13] Where a background is used, it is preferable
to keep the caption background visible even in times where no text
is displayed, such that it minimizes distraction. However, where captions
are infrequent the background should be allowed to disappear to enable
the user to see as much of the underlying video as possible.
It may be technically possible to have cues without text.
[CC-14] Allow the use of mixed display styles—
e.g., mixing paint-on captions with pop-on captions— within a single
caption cue or in the caption stream as a whole. Pop-on captions are
usually one or two lines of captions that appear on screen and remain
visible for one to several seconds before they disappear. Paint-on
captions are individual characters that are "painted on" from
left to right, not popped onto the screen all at once, and usually
are verbatim. Another often-used caption style in live captioning is
roll-up - here, cue text follows double chevrons ("greater than" symbols),
and are used to indicate different speaker identifications. Each sentence "rolls
up" to about three lines. The top line of the three disappears
as a new bottom line is added, allowing the continuous rolling up of
new lines of captions.
Similarly, in karaoke, individual characters are often "painted
[CC-15] Support positioning such that the lowest
line of captions appears at least 1/12 of the total screen height above
the bottom of the screen, when rendered as text in a right-to-left
or left-to-right language.
[CC-16] Use conventions that include inserting
left-to-right and right-to-left segments within a vertical run (e.g.
Tate-chu-yoko in Japanese), when rendered as text in a top-to-bottom
[CC-17] Represent content of different natural
languages. In some cases the inclusion of a few foreign words form
part of the original soundtrack, and thus need to be in the same caption
resource. Also allow for separate caption files for different languages
and on-the-fly switching between them. This is also a requirement for
Caption/subtitle files that are alternatives in different languages
are probably best provided in different caption resources and are user
selectable. Realistically, having no more than 2 languages present at
the same time on screen is probably the limit.
[CC-18] Represent content of at least those specific
natural languages that may be represented with [Unicode 3.2], including
common typographical conventions of that language (e.g., through the
use of furigana and other forms of ruby text).
[CC-19] Present the full range of typographical
glyphs, layout and punctuation marks normally associated with the natural
language's print-writing system.
[CC-20] Permit in-line mark-up for foreign words
Italics markup may be sufficient for a human user, but it is important
to be able to mark up languages so that the text can be rendered correctly,
since the same Unicode can be shared between languages and rendered differently
in different contexts. This is mainly an localization issue. It is also important
for audio rendering, to get correct pronunciation.
[CC-21] Permit the distinction between different
Further, systems that support captions must:
[CC-22] Support captions that are provided inside
media resources as tracks, or in external files.
It is desirable to expose the same API to both.
[CC-23] Ascertain that captions are displayed in
sync with the media resource.
[CC-24] Support user activation/deactivation of
This requires a menu of some sort that displays the available
tracks for activation/deactivation.
[CC-25] Support edited and verbatim captions, if
Edited and verbatim captions can be provided in two different
caption resources. There is a need to expose to the user how they differ,
similar to how there can be caption tracks in different languages.
[CC-26] Support multiple tracks of foreign-language
subtitles in different languages.
These different-language "tracks" can be provided in
[CC-27] Support live-captioning functionality.
3.7 Enhanced captions/subtitles
Enhanced captions are timed text cues that have been enriched with further
information - examples are glossary definitions for acronyms and other
intialisms, foreign terms (for example, Latin), jargon or descriptions
for other difficult language. They may be age-graded, so that multiple
caption tracks are supplied, or the glossary function may be added dynamically
through machine lookup.
Glossary information can be added in the normal time allotted for the
cue (e.g., as a callout or other overlay), or it might take the form of
a hyperlink that, when activated, pauses the main content and allows access
to more complete explanatory material.
Such extensions can provide important additional information to the content
that will enable or improve the understanding of the main content to users of assistive
technology. Enhanced text cues will be particularly useful for those with restricted
reading skills, to subtitle users, and to caption users. Users may often
come across keywords in text cues that lend themselves to further in-depth
information or hyperlinks, such as an e-mail contact or phone number for
a person, a strange term that needs a link to a definition, or
an idiom that needs comments to explain it to a foreign-language speaker.
Systems that support enhanced captions must:
[ECC-1] Support metadata markup for (sections of)
timed text cues.
Such "metadata" markup can be realized through a @title
attribute on a <span> of the text, or a hyperlink to another location
where a term is explained, an <abbr> element, an <acronym> element,
a <dfn> element, or through RDFa or microdata.
[ECC-2] Support hyperlinks and other activation
mechanisms for supplementary data for (sections of) caption text.
This can be realized through inclusion of links or
buttons into timed text cues, where additional overlays could be created
or a different page be loaded. One needs to deal here with the need to
pause the media timeline for reading of the additional information.
[ECC-3] Support text cues that may be longer than
the time available until the next text cue and thus provide overlapping
text cues - in this case, a feature should be provided to decide if
overlap is ok or should be cut or the media resource be paused while
the caption is displayed. Timing would be provided by the author, but
with the user being able to override it.
This feature is analogous to extended video descriptions - where
timing for a text cue is longer than the available time for the cue,
it may be necessary to halt the media to allow for more time to read
back on the text and its additional material. In this case, the pause
is dependent on the user's reading speed, so this may imply user control
[ECC-4] It needs to be possible to define timed
text cues that are allowed to overlap with each other in time and be
present on screen at the same time (e.g., those that come from speech
of different speakers), and such that are not allowed to overlap and
thus cause media playback pause to allow users to catch up with their
[ECC-5] Allow users to define the reading speed
and thus define how long each text cue requires, and whether media
playback needs to pause sometimes to let them catch up on their reading.
This can be a setting in the UA, which will define user-interface
3.8 Sign translation
Sign language shares the same concept as captioning: it presents both
speech and non-speech information in an alternative format. Note that due
to the wide regional variation in signing systems (e.g., American Sign
Language vs British Sign Language), sign translation may not be appropriate
for content with a global audience unless localized variants can be made
Signing can be open, mixed with the video and offered as an entirely alternative
stream or closed (using some form of picture-in-picture or alpha-blending
technology). It is possible to use quite low bit rates for much of the
signing track, but it is important that facial, arm, hand and other body
gestures be delivered at sufficient resolution to support legibility. Animated
avatars may not currently be sufficient as a substitute for human signers,
although research continues in this area and it may become practical at
some point in the future.
Acknowledging that not all devices will be capable of handling multiple
video streams, this is a SHOULD requirement for browsers where hardware
is capable of support. Strong authoring guidance for content creators will
mitigate situations where user-agents are unable to support multiple video
streams (WCAG) - for example, on mobile devices that cannot support multiple
streams, authors should be encouraged to offer two versions of the media
stream, including one with signed captions burned into the media.
Selecting from multiple tracks for different sign languages should be
achieved in the same fashion that multiple caption/subtitle files are handled.
Systems supporting sign language must:
[SL-1] Support sign-language video either as a track as part of
a media resource or as an external file.
[SL-2] Support the synchronized playback of the sign-language video
with the media resource.
[SL-3] Support the display of sign-language video either as picture-in-picture
or alpha-blended overlay, as parallel video, or as the main video with
the original video as picture-in-picture or alpha-blended overlay.
Parallel video here means two discrete videos playing in sync with
each other. It is preferable to have one discrete <video> element
contain all pieces for sync purposes rather than specifying multiple <video> elements
intended to work in sync.
[SL-4] Support multiple sign-language tracks in several sign languages.
[SL-5] Support the interactive activation/deactivation of a sign-language
track by the user.
While synchronized captions are generally preferable for people with hearing
impairments, for some users they are not viable – those who are deaf-blind,
for example, or those with cognitive or reading impairments that make it
impossible to follow synchronized captions. And even with ordinary captions,
it is possible to miss some information as the captions and the video require
two separate loci of attention. The full transcript supports different
user needs and is not a replacement for captioning. A transcript can either
be presented simultaneously with the media material, which can assist slower
readers or those who need more time to reference context, but it should
also be made available independently of the media.
A full text transcript should include information that would be in both
the caption and video description, so that it is a complete representation
of the material, as well as containing any interactive options.
Systems supporting transcripts must:
Support the provisioning of a full text transcript for the
media asset in a separate but linked resource, where the linkage is
programmatically accessible to AT
Support the provisioning of both scrolling and static display
of a full text transcript with the media resource, e.g., in an area next
to the video or underneath the video, which is also AT
[T-3] Allow the user to customize the visual rendering of the full
text transcript, e.g., font, font size, foreground and background color, line, letter, and word spacing.
4. System Requirements
4.3 Time-scale modification
While all devices may not support the capability, a standard control API
must support the ability to speed up or slow down content presentation
without altering audio pitch.
While perhaps unfamiliar to some, this feature has been present
on many devices, especially audiobook players, for some 20 years now.
The user can adjust the playback rate of prerecorded time-based media
content, such that all of the following are true (UAAG 2.0 2.11.4):
[TSM-1] The user can adjust the playback rate of the time-based
media tracks to between 50% and 250% of real time.
[TSM-2] Speech whose playback rate has been adjusted by the user
maintains pitch in order to limit degradation of the speech quality.
[TSM-3] All provided alternative media tracks remain synchronized
across this required range of playback rates.
[TSM-4] The user agent provides a function that resets the playback
rate to normal (100%).
[TSM-5] The user can stop, pause, and resume rendered audio and
animation content (including video and animated images) that last three
or more seconds at their default playback rate. (UAAG 2.0 2.11.5)
4.4 Production practice and resulting requirements
One of the biggest challenges to date has been the lack of a universal system
for media access. In response to user requirements various countries and
groups have defined systems to provide accessibility, especially captioning
for television. However these systems are typically not compatible. In
some cases the formats can be inter-converted, but some formats — for example
DVD sub-pictures — are image based and are difficult to convert to text.
Caption formats are often geared towards delivery of the media, for example
as part of a television broadcast. They are not well suited to the production
phases of media creation. Media creators have developed their own internal
formats which are more amenable to the editing phase, but to date there
has been no common format that allows interchange of this data.
Any media based solution should attempt to reduce as far as possible layers
of translation between production and delivery.
In general captioners use a proprietary workstation to prepare caption
files; these can often export to various standard broadcast ingest formats,
but in general files are not inter-convertible. Most video editing suites
are not set up to preserve captioning, and so this has typically to be
added after the final edit is decided on; furthermore since this work is
often outsourced, the copyright holder may not hold the final editable
version of the captions. Thus when programming is later re-purposed, e.g.
a shorter edit is made, or a ‘directors cut’ produced, the captioning may
have to be redone in its entirety. Similarly, and particularly for news
footage, parts of the media may go to web before the final TV edit is made,
and thus the captions that are produced for the final TV edit are not available
for the web version.
It is important when purchasing or commissioning media, that captioning
and described video is taken into account and made equal priority in terms
of ownership, rights of use, etc., as the video and audio itself.
This is primarily an authoring requirement. It is understood that a
common time-stamp format must be declared in HTML5, so that authoring tools
can conform to a required output.
Systems supporting accessibility needs for media must:
[PP-1] Support existing production practice for alternative content
resources, in particular allow for the association of separate alternative
content resources to media resources. Browsers cannot support all forms
of time-stamp formats out there, just as they cannot support all forms
of image formats (etc.). This necessitates a clear and unambiguous
declared format, so that existing authoring tools can be configured
to export finished files in the required format.
[PP-2] Support the association of authoring and rights metadata
with alternative content resources, including copyright and usage information.
[PP-3] Support the simple replacement of alternative content resources
even after publishing. This is again dependent on authoring practice
- if the content creator delivers a final media file that contains
related accessibility content inside the media wrapper (for example
an MP4 file), then it will require an appropriate third-party authoring
tool to make changes to that file - it cannot be demanded of the browser
to do so.
[PP-4] Typically, alternative content resources are created by different
entities to the ones that create the media content. They may even be
in different countries and not be allowed to re-publish the other one's
content. It is important to be able to host these resources separately,
associate them together through the web page author, and eventually
play them back synchronously to the user.
4.5 Discovery and activation/deactivation of available alternative content
by the user
As described above, individuals need a variety of media (alternative content)
in order to perceive and understand the content. The author or some web
mechanism provides the alternative content. This alternative content may
be part of the original content, embedded within the media container as
'fallback content', or linked from the original content. The user is faced
with discovering the availability of alternative content.
Alternative content must be both discoverable by the user, and accessible
in device agnostic ways. The development of APIs and user-agent controls
should adhere to the following UAAG guidance:
The user agent can facilitate the discovery of alternative content by
following these criteria:
[DAC-1] The user has the ability to have indicators rendered along
with rendered elements that have alternative content (e.g., visual
icons rendered in proximity of content which has short text alternatives,
long descriptions, or captions). In cases where the alternative content
has different dimensions than the original content, the user has the
option to specify how the layout/reflow of the document should be handled.
(UAAG 2.0 1.8.7).
[DAC-2] The user has a global option to specify which types of alternative
content by default and, in cases where the alternative content has
different dimensions than the original content, how the layout/reflow
of the document should be handled. (UAAG 2.0 1.8.7).
[DAC-3] The user can browse the alternatives and switch between
[DAC-4] Synchronized alternatives for time-based media (e.g., captions,
descriptions, sign language) can be rendered at the same time as their
associated audio tracks and visual tracks (UAAG 2.0 2.11.4).
[DAC-5] Non-synchronized alternatives (e.g., short text alternatives,
long descriptions) can be rendered as replacements for the original
rendered content (UAAG 2.0 1.1.3).
[DAC-6] Provide the user with the global option to configure a cascade
of types of alternatives to render by default, in case a preferred
alternative content type is unavailable (UAAG 2.0 1.1.4).
[DAC-7] During time-based media playback, the user can determine
which tracks are available and select or deselect tracks. These selections
may override global default settings for captions, descriptions, etc.
[DAC-8] Provide the user with the option to load time-based media
content such that the first frame is displayed (if video), but the
content is not played until explicit user request. (UAAG 2.0 2.11.2)
[DAC-9] Provide the user with the option to record alternative content
along with the primary content on devices where recording is available.
This feature can be user configurable to allow maximum flexibility in trading off the
anticipated future need for the description against the amount of extra data storage required. A flexible
solution giving maximum control to the user would be to provide a global setting with the following
- Always record the alternative content (the best default option, since a resource recorded by one user may later be
accessed by another different user who may have different and unanticipated requirements);
- Record the alternative content only if it is active at the time of recording;
- Ask at recording time whether to record the alternative content;
- Never record the alternative content.
4.6 Requirements on making properties available to the accessibility interface
Often forgotten in media systems, especially with the newer forms of packaging
such as DVD menus and on-screen program guides, is the fact that the user
needs to actually get to the content, control its playback, and turn on
any required accessibility options. For user agents supporting accessibility
APIs implemented for a platform, any media controls need to be connected
to that API.
On self-contained products that do not support assistive technology, any
menus in the content need to provide information in alternative formats
(e.g., talking menus). Products with a separate remote control, or that
are self-contained boxes, should ensure the physical design does not block
access, and should make accessibility controls, such as the closed-caption
toggle, as prominent as the volume or channel controls.
[API-1] The existence of alternative-content tracks for a media
resource must be exposed to the user agent.
[API-2] Since authors will need access to the alternative content
tracks, the structure needs to be exposed to authors as well, which
requires a dynamic interface.
[API-3] Accessibility APIs need to gain access to alternative content
tracks no matter whether those content tracks come from within a resource
or are combined through markup on the page.
4.7 Requirements on the use of the viewport
The video viewport plays a particularly important role with respect to
alternative-content technologies. Mostly it provides a bounding box for
many of the visually represented alternative-content technologies (e.g.,
captions, hierarchical navigation points, sign language), although some
alternative content does not rely on a viewport (e.g., full transcripts,
One key principle to remember when designing player ‘skins’ is that the
lower-third of the video may be needed for caption text. Caption consumers
rely on being able to make fast eye movements between the captions and
the video content. If the captions are in a non-standard place, this may
cause viewers to miss information. The use of this area for things such
as transport controls, while appealing aesthetically, may lead to accessibility
[VP-1] It must be possible to deal with three different
cases for the relation between the viewport size, the position of media
and of alternative content:
- the alternative content's extent is specified in relation
to the media viewport (e.g., picture-in-picture video, lower-third
- the alternative content has its own independent extent,
but is positioned in relation to the media viewport (e.g., captions
above the audio, sign-language video above the audio, navigation
points below the controls)
- the alternative content has its own independent extent and
doesn't need to be rendered in any relation to the media viewport
(e.g., text transcripts)
If alternative content has a different height or width than the media
content, then the user agent will reflow the (HTML) viewport. (UAAG 2.0
This may create a need to provide an author hint to the web page
when embedding alternative content in order to instruct the web page how
to render the content: to scale with the media resource, scale independently,
or provide a position hint in relation to the media. On small devices
where the video takes up the full viewport, only limited rendering choices
may be possible, such that the UA may need to override author preferences.
[VP-2] The user can change the following characteristics
of visually rendered text content, overriding those specified by the
author or user-agent defaults (UAAG 2.0 1.4.1). (Note: this should
include captions and any text rendered in relation to media elements,
so as to be able to magnify and simplify rendered text):
- text scale (i.e., the general size of text),
- font family
- text color (i.e., foreground and background)
- letter spacing (tracking and kerning)
- line spacing (or line height), and
- word spacing.
This should be achievable through UA configuration or even through
something like a greasemonkey script or user
CSS which can override styles dynamically in the browser.
[VP-3] Provide the user with the ability to adjust
the size of the time-based media up to the full height or width of
the containing viewport, with the ability to preserve aspect ratio
and to adjust the size of the playback viewport to avoid cropping,
within the scaling limitations imposed by the media itself. (UAAG 2.0
This can be achieved by simply zooming into the web page, which
will automatically rescale the layout and reflow the content.
[VP-4] Provide the user with the ability to control
the contrast and brightness of the content within the playback viewport.
(UAAG 2.0 2.11.8)
This is a user-agent device requirement and should already be
addressed in the UAAG. In live content, it may even be possible to adjust
camera settings to achieve this requirement. It is also a "SHOULD" level
requirement, since it does not account for limitations of various devices.
[VP-5] Captions and subtitles traditionally occupy
the lower third of the video, where controls are also usually
rendered. The user agent must avoid overlapping of overlay content
and controls on media resources. This must also happen if, for example,
the controls are only visible on demand.
If there are several types of overlapping overlays, the controls
should stay on the bottom edge of the viewport and the others should
be moved above this area, all stacked above each other.
4.8 Requirements on secondary screens and other devices
Multiple secondary user devices must be directly addressable. This
functionality is increasingly also known by the new term, "Second Screen,"
even though there may be more than two screens in any given viewing
environment, and even though not all secondary devices are video displays. It
must be assumed that many users will have at least one additional display
device (such as a tablet), and/or at least one additional audio output device
(such as a Bluetooth headset) attached to a primary video display device, an
individual computer, or locally addressable on a LAN. It
must be possible to configure certain types of media for presentation on
specific devices, and these configuration settings must be readily overwritable
on a case-by-case basis by users.
Systems supporting secondary devices must:
[SD-1] Support a platform-accessibility architecture relevant to
the operating environment. (UAAG 2.0 4.1.1)
[SD-2] Ensure accessibility of all user-interface components including
the user interface, rendered content, and alternative content; make
available the name, role, state, value, and description via a platform-accessibility
architecture. (UAAG 2.0 4.1.2)
[SD-3] If a feature is not supported by the accessibility architecture(s),
provide an equivalent feature that does support the accessibility architecture(s).
Document the equivalent feature in the conformance claim. (UAAG 2.0
[SD-4] If the user agent implements one or more DOMs, they must
be made programmatically available to assistive technologies. (UAAG
2.0 4.1.4) This assumes the video element will write to the DOM.
[SD-5] If the user can modify the state or value of a piece of content
through the user interface (e.g., by checking a box or editing a text
area), the same degree of write access is available programmatically
(UAAG 2.0 4.1.5).
[SD-6] If any of the following properties are supported by the accessibility-platform
architecture, make the properties available to the accessibility-platform
architecture (UAAG 2.0 4.1.6):
- the bounding dimensions and coordinates of rendered graphical
- font family;
- font size;
- text foreground color;
- text background color;
- change state/value notifications.
[SD-7] Ensure that programmatic exchanges between APIs proceed at
a rate such that users do not perceive a delay. (UAAG 2.0 4.1.7).
The following people contributed to the development of this document.
A.1 Participants in the PFWG and HTML Accessibility Task Force at the time of publication
- Jim Allan (Invited Expert, Texas School for the Blind)
- Christy Blew (Invited Expert, University of Illinois)
- David Bolter (Mozilla Foundation)
- Judy Brewer (W3C)
- Sally Cain (RNIB)
- Eric Carlson (Apple, Inc.)
- Wendy Chisholm (Microsoft Corporation)
- Michael Cooper (W3C/MIT)
- Paul Cotton (Microsoft Corporation)
- James Craig (Apple Inc.)
- Joanmarie Diggs (Igalia)
- Jean-Pierre Evain (European Broadcasting Union)
- Steve Faulkner (The Paciello Group)
- John Foliot (Invited Expert)
- Kelly Ford (Microsoft Corporation)
- Christopher Gallelo (Microsoft Corporation)
- Bryan Garaventa (SSB BART Group)
- Scott Gonzàlez (JQuery Foundation)
- Billy Gregory (The Paciello Group)
- Karl Groves (The Paciello Group)
- Jon Gunderson (Invited Expert, University of Illinois)
- Birkir Gunnarsson (Deque Systems, Inc.)
- Sean Hayes (Microsoft Corporation)
- Ian Hickson (Google, Inc.)
- Markus Gylling (DAISY Consortium)
- Mona Heath (Invited Expert, University of Illinois)
- Kenny Johar (Microsoft Corporation)
- Susann Keohane (IBM Corporation)
- Matthew King (IBM Corporation)
- Jason Kiss (Department of Internal Affairs, New Zealand Government)
- Masatomo Kobayashi (IBM Corporation)
- Philippe Le Hégaret (W3C)
- Bob Lund (Cable Television Laboratories Inc)
- David MacDonald (Invited Expert)
- Jatinder Mann (Microsoft Corporation)
- Dominic Mazzoni (Google, Inc.)
- Shane McCarron (Invited Expert, Aptest)
- Charles McCathieNevile (Yandex)
- Mary Jo Mueller (IBM Corporation)
- Jay Munro (Microsoft Corporation)
- James Nurthen (Oracle Corporation)
- Edward O'Connor (Apple, Inc.)
- Joseph Karr O'Connor (Invited Expert)
- Frank Oliver (Microsoft Corporation)
- Silvia Pfeiffer (National ICT Australia (NICTA) Ltd)
- Ian Pouncey (British Broadcasting Corporation)
- Adrian Roselli (Invited Expert)
- Sam Ruby (IBM Corporation)
- Mark Sadecki (W3C)
- Janina Sajka (Invited Expert, The Linux Foundation)
- Joseph Scheuhammer (Invited Expert, Inclusive Design Research Centre, OCAD University)
- Stefan Schnabel (SAP AG)
- Richard Schwerdtfeger (IBM Corporation)
- Lisa Seeman (Invited Expert)
- Cynthia Shelly (Microsoft Corporation)
- David Singer (Apple, Inc.)
- Michael Smith (W3C)
- Jeanne Spellman (W3C)
- Maciej Stachowiak (Apple, Inc.)
- Alexander Surkov (Mozilla Foundation)
- Suzanne Taylor (Pearson plc)
- Matthew Turvey (Invited Expert)
- Léonie Watson (The Paciello Group)
- Mark Watson (Netflix Inc.)
- Wu Wei (W3C / RITT)
- Marco Zehe (Mozilla Foundation)
- Gottfried Zimmermann (Invited Expert, Access Technologies Group)
A.2 Other previously active PFWG participants and contributors
Kazuyuki Ashimura (W3C), Simon Bates, Chris Blouch (AOL), Ben Caldwell (Trace), Charles Chen (Google, Inc.), Christian Cohrs, Dimitar Denev (Frauenhofer Gesellschaft), Donald Evans (AOL), Geoff Freed (Invited Expert, NCAM), Kentarou Fukuda (IBM Corporation), Becky Gibson (IBM), Alfred S. Gilman, Andres Gonzalez (Adobe Systems Inc.), Georgios Grigoriadis (SAP AG), Jeff Grimes (Oracle), Barbara Hartel, John Hrvatin (Microsoft Corporation), Masahiko Kaneko (Microsoft Corporation), Earl Johnson (Sun), Jael Kurz, Diego La Monica (International Webmasters Association / HTML Writers Guild (IWA-HWG)), Gez Lemon (International Webmasters Association / HTML Writers Guild (IWA-HWG)), Aaron Leventhal (IBM Corporation), Alex Li (SAP), Thomas Logan (HiSoftware Inc.), William Loughborough (Invited Expert), Linda Mao (Microsoft), Anders Markussen (Opera Software), Matthew May (Adobe Systems Inc.), Joshue O Connor (Invited Expert), Artur Ortega (Yahoo!, Inc.), Lisa Pappas (Society for Technical Communication (STC)), Dave Pawson (RNIB), David Poehlman, Simon Pieters (Opera Software), Sarah Pulis (Media Access Australia), T.V. Raman (Google, Inc.), Jan Richards (IDRC), Gregory Rosmaita (Invited Expert), Tony Ross (Microsoft Corporation), Martin Schaus (SAP AG), Marc Silbey (Microsoft Corporation), Henri Sivonen (Mozilla), Andi Snow-Weaver (IBM Corporation), Henny Swan (Opera Software), Vitaly Sourikov, Mike Squillace (IBM), Gregg Vanderheiden (Invited Expert, Trace), Ryan Williams (Oracle), Tom Wlodkowski.
A.3 Enabling funders
This publication has been funded in part with Federal funds from the U.S. Department of Education, National Institute on Disability, Independent Living, and Rehabilitation Research (NIDILRR) under contract number ED-OSE-10-C-0067. The content of this publication does not necessarily reflect the views or policies of the U.S. Department of Education, nor does mention of trade names, commercial products, or organizations imply endorsement by the U.S. Government.