This Wiki page is edited by participants of the HTML Accessibility Task Force. It does not necessarily represent consensus and it may have incorrect information or information that is not supported by other Task Force participants, WAI, or W3C. It may also have some very useful information.

Media Accessibility User Requirements

From HTML accessibility task force Wiki
Jump to: navigation, search

This document has become a deliverable of the Protocols and Formats Working Group. The public Working Draft of that deliverable is located in the Media Accessibility User Requirements Working Draft. Any changes made to this wiki version will not be reflected in the PFWG version unless explicitly arranged with the editors.

This document aggregates the requirements of an accessibility user that the W3C HTML5 Accessibility Task Force has collected with respect to audio and video on the Web.

It first introduces a background on the needs of sensory impaired users, which is particularly meant as an introduction for people who never had to consider such needs in relation to audio and video.

Then it explains what alternative content technologies have been developed to help such users gain access to the content of audio and video.

A third section explains how these content technologies fit in the larger picture of an accessibility system, both technically within a Web user agent and from a production process point of view.

This document is most explicitly not a collection of baseline user agent or authoring tool requirements. It is important to recognize that not all user agents (nor all authoring tools) will support all the features discussed in this document. Rather, this document attempts to supply a comprehensive collection of user requirements needed to support media accessibility in the context of HTML 5. As such, it should be expected that this document will continue to develop for some time.

Please also note this document is not an inventory of technology currently provided by, or missing from HTML 5 specification drafts. Technology listed here is here because it's important for accommodating the alternative access needs of users with disabilities to web-based media. This document is our inventory of Media Accessibility User Requirements.

Media Accessibility Checklist

The following User Requirements have also been distilled into a Media Accessibility Checklist, which can be found at:

Accessible Media Requirements by Type of Disability

Comprehension of media may be affected by loss of visual function, loss of audio function, or both. Cognitive disabilities may affect access to and/or comprehension of media. Physical disabilities such as dexterity impairment, loss of limbs, or loss of use of limbs may affect access to media. Once richer forms of media, such as virtual reality, become more commonplace, tactile issues may come into play. Control of the media player can be an important issue, e.g. for mobility problems, however this is typically not addressed by the media formats themselves, but is a requirement of the technology used to build the player.

Editorial note: This section of "Media Accessibility User Requirements" may be edited to further align with the "Accessibility Barriers" section of How People with Disabilities Use the Web once that document is complete. It is provided as-is at this time in order to present general background for sections 2 and 3 of this document.


People who are blind cannot access information if it is presented only in the visual mode; they require information in an alternative representation, which typically means the audio mode, although information can also be presented as text. It is important to remember that not only the main video is inaccessible, but any other visible ancillary information such as stock tickers, status indicators or other on-screen graphics, as well as any visual controls needed to operate the content. Since people who are blind use a screen reader and/or refreshable braille display, these assistive technologies (ATs) need to work hand-in-hand with the access mechanism provided for the media content.

Low vision

People with low vision can use some visual information, although they will have similar issues as people who are blind. Depending on their visual ability they might have specific issues such as difficulty discriminating foreground information from background information, or discriminating colors. Glare caused by excessive scattering in the eye can be a significant problem, especially for very bright content or surroundings. They may be unable to react quickly to transient information, and may have a narrow angle of view and so may not detect key information presented temporarily where they are not looking, or in text that is moving or scrolling. A person using a low-vision AT aid, such as a screen magnifier, will only be viewing a portion of the screen, and so must manage tracking media content via their AT. They may have difficulty reading when text is too small, has poor background contrast, or when outline or other fancy font types or effects are used. They may be using an AT that adjusts all the colors of the screen, such as inverting the colors, so the media content must be viewable through the AT.

Atypical color perception

A significant percentage of the population has atypical color perception, and may not be able to discriminate between different colors, or may miss key information when coded with color only.


People who are deaf generally cannot use audio. Thus, an alternative representation is required, typically through synchronized captions and/or sign translation.

Hard of hearing

People who are hard of hearing may be able to use some audio material, but might not be able to discriminate certain types of sound, and may miss any information presented as audio only if it contains frequencies they can't hear, or is masked by background noise or distortion. They may miss audio which is too quiet, or of poor quality. Speech may be problematic if it is too fast and cannot be played back more slowly. Information presented using multichannel audio (e.g., stereo) may not be perceived by people who are deaf in one ear.


Individuals who are deaf-blind have a combination of conditions that may result in one of the following: blindness and deafness; blindness and difficulty in hearing; low vision and deafness; or low vision and difficulty in hearing. Depending on their combination of conditions, individuals who are deaf-blind may need captions that can be enlarged, changed to high-contrast colors, or otherwise styled; or they may need captions and/or described video that can be presented with AT (e.g., a refreshable braille display). They may need synchronized captions and/or described video, or they may need a non-time-based transcript which they can read at their own pace.

Dexterity/mobility impairment

People with physical disabilities such as dexterity, loss of limbs, or loss of use of limbs may use the keyboard alone rather than the combination of a pointing device plus keyboard to interact with content and controls, or may use a switch with an on-screen keyboard, or other assistive-technology access. The player itself must be usable via the keyboard and pointing devices. The user must have full access to all player controls, including methods for selecting alternative content.

Cognitive and neurological disabilities

Cognitive and neurological disabilities include a wide range of conditions that may include intellectual disabilities (called learning disabilities in some regions), autism-spectrum disorders, memory impairments, mental-health disabilities, attention-deficit disorders, audio- and/or visual-perceptive disorders, dyslexia and dyscalculia (called learning disabilities in other regions), or seizure disorders. Necessary accessibility supports vary widely for these different conditions. Individuals with some conditions may process information aurally better than by reading text; therefore, information that is presented as text embedded in a video should also be available as audio descriptions. Individuals with other conditions may need to reduce distractions or flashing in presentations of video. Some conditions such as autism-spectrum disorders may have multi-system effects and individuals may need a combination of different accommodations; see below for example.


Individuals with an autism-spectrum disorder are commonly impacted in the areas of communication, social interaction, and repetitive behaviors. They can have difficulty interpreting and expressing social communication, as well as difficulty shifting between context and activities. Therefore, a supplemental content track could be used to focus the individual’s attention on the key points of the media. For example, supplemental text could point out the key educational messages or plainly state the meaning of social interactions. Verbal communications could be broken down into the key messages; tone of voice could be interpreted; phrases of speech and communication styles such as sarcasm could be explained.

Individuals on the autism spectrum can be quite visual and learn effectively from social stories. A social story is a simple description of a social situation, such as an upcoming event, a social interaction, or a change in routine. A social story is commonly a series of pictures, supported by simple text to describe the actions, behavior, and outcomes. This technique could be carried over to media by providing a social story as alternative content. The media of the social story could be a combination of pictures and synchronized text or audio.

Overall, the media experience for people on the autism spectrum should be customizable and well designed so as to not be overwhelming. Care must be taken to present a media experience that focuses on the purpose of the content and provides alternative content in a clear, concise manner.

Alternative Content Technologies

A number of alternative content types have been developed to help users with sensory disabilities gain access to audio-visual content. This section lists them, explains generally what they are, and provides a number of requirements on each that need to be satisfied with technology developed in HTML5 around the media elements.

Described video

Described video contains descriptive narration of key visual elements designed to make visual media accessible to people who are blind or visually impaired. The descriptions include actions, costumes, gestures, scene changes or any other important visual information that someone who cannot see the screen might ordinarily miss. Descriptions are traditionally audio recordings timed and recorded to fit into natural pauses in the program, although they may also briefly obscure the main audio track. (See the section on extended descriptions for an alternative approach.) The descriptions are usually read by a narrator with a voice that cannot be easily confused with other voices in the primary audio track. They are authored to convey objective information (e.g., a yellow flower) rather than subjective judgments (e.g., a beautiful flower).

As with captions, descriptions can be open or closed.

  • Open descriptions are merged with the program-audio track and cannot be turned off by the viewer.
  • Closed descriptions can be turned on and off by the viewer. They can be recorded as a separate track containing descriptions only, timed to play at specific spots in the timeline and played in parallel with the program-audio track.
  • Some descriptions can be delivered as a separate audio channel mixed in at the player.
  • Other options include a computer-generated ‘text to speech’ track, also known as text video descriptions. This is described in the next subsection.

Described video provides benefits that reach beyond blind or visually impaired viewers; e.g., students grappling with difficult materials or concepts. Descriptions can be used to give supplemental information about what is on screen—the structure of lengthy mathematical equations or the intricacies of a painting, for example.

Described video is available on some television programs and in many movie theaters in the U.S. and other countries. Regulations in the U.S. and Europe are increasingly focusing on description, especially for television, reflecting its priority with citizens who have visual impairments. The technology needed to deliver and render basic video descriptions is in fact relatively straightforward, being an extension of common audio-processing solutions. Playback products must support multi-audio channels required for description, and any product dealing with broadcast TV content must provide adequate support for descriptions. Descriptions can also provide text that can be indexed and searched.


Systems supporting described video that are not open descriptions must:

  • (DV-1) Provide an indication that descriptions are available, and are active/non-active.
  • (DV-2) Render descriptions in a time-synchronized manner, using the media resource as the timebase master.
  • (DV-3) Support multiple description tracks (e.g., discrete tracks containing different levels of detail).
  • (DV-4) Support recordings of real human speech as a track of the media resource, or as an external file.
  • (DV-5) Allow the author to independently adjust the volumes of the audio description and original soundtracks.
  • (DV-6) Allow the user to independently adjust the volumes of the audio description and original soundtracks, with the user's settings overriding the author's.
  • (DV-7) Permit smooth changes in volume rather than stepped changes. The degree and speed of volume change should be under provider control.
  • (DV-8) Allow the author to provide fade and pan controls to be accurately synchronised with the original soundtrack.
  • (DV-9) Allow the author to use a codec which is optimised for voice only, rather than requiring the same codec as the original soundtrack.
  • (DV-10) Allow the user to select from among different languages of descriptions, if available, even if they are different from the language of the main soundtrack.
  • (DV-11) Support the simultaneous playback of both the described and non-described audio tracks so that one may be directed at separate outputs (e.g., a speaker and headphones).
  • (DV-12) Provide a means to prevent descriptions from carrying over from one program or channel when the user switches to a different program or channel.
  • (DV-13) Allow the user to relocate the description track within the audio field, with the user setting overriding the author setting. The setting should be re-adjustable as the media plays.
  • (DV-14) Support metadata, such as copyright information, usage rights, language, etc.

Text video description

Described video that uses text for the description source rather than a recorded voice creates specific requirements.

Text video descriptions (TVDs) are delivered to the client as text and rendered locally by assistive technology such as a screen reader or a braille device. This can have advantages for screen-reader users who want full control of the preferred voice and speaking rate, or other options to control the speech synthesis.

Text video descriptions are provided as text files containing start times for each description cue. Since the duration that a screen reader takes to read out a description cannot be determined during authoring of the cues, it is difficult to ensure they don't obscure the main audio or other description cues. This is likely to be caused by at least three reasons:

  • An author of text video descriptions does not have a screen reader. This means s/he cannot check if the description fits within the time frame. Even if s/he has a screen reader, a user's screen reader will be set to a different reading speed and may take longer to read the same sentence.
  • Some screen-reader users (e.g., those who are elderly or have learning disabilities) may slow down the speech rate.
  • A visually complicated scene (e.g., figures on a blackboard in an online physics class) may require more description time than is available in the program-audio track.


Systems supporting text video descriptions must:

  • (TVD-1) Support presentation of text video descriptions through a screen reader or braille device, with playback speed control and voice control and synchronisation points with the video.
  • (TVD-2) TVDs need to be provided in a format that contains the following information:
    • (A) start time, text per description cue (the duration is determined dynamically, though an end time could provide a cut point)
    • (B) possibly a speech-synthesis markup to improve quality of the description (existing speech synthesis markups include SSML and Speech CSS)
    • (C) accompanying metadata providing labeling for speakers, language, etc.
  • (TVD-3) Where possible, provide a text or separate audio track privately to those that need it in a mixed-viewing situation, e.g., through headphones.
  • (TVD-4) Where possible, provide options for authors and users to deal with the overflow case: continue reading, stop reading, and pause the video. (One solution from a user's point of view may be to pause the video and finish reading the TVD, for example.) User preference should override authored option.
  • (TVD-5) Support the control over speech-synthesis playback speed, volume and voice, and provide synchronisation points with the video.

Extended video descriptions

Video descriptions are usually provided as recorded speech, timed to play in the natural pauses in dialog or narration. In some types of material, however, there is not enough time to present sufficient descriptions. To meet such cases, the concept of extended description was developed. Extended descriptions work by pausing the video and program audio at key moments, playing a longer description than would normally be permitted, and then resuming playback when the description is finished playing. This will naturally extend the timeline of the entire presentation. This procedure has not been possible in broadcast television; however, hard-disk recording and on-demand Internet systems can make this a practical possibility.

Extended video description (EVD) has been reported to have benefits for cognitive disabilities; for example, it may be of benefit for Aspergers Syndrome and other autistic-spectrum problems, in that it can make connections between cause and effect, point out what is important to look at, or explain moods that might otherwise be missed.


Systems supporting extended audio descriptions must:

  • (EVD-1) Support detailed user control as specified in (TVD-4) for extended video descriptions.
  • (EVD-2) Support automatically pausing the video and main audio tracks in order to play a lengthy description.
  • (EVD-3) Support resuming playback of video and main audio tracks when the description is finished.

Note that this is an advanced feature and would only be expected by advanced systems.

Clean audio

A relatively recent development in television accessibility is the concept of clean audio, which takes advantage of the increased adoption of multichannel audio. This is primarily aimed at audiences who are hard of hearing, and consists of isolating the audio channel containing the spoken dialog and important non-speech information that can then be amplified or otherwise modified, while other channels containing music or ambient sounds are attenuated.

Using the isolated audio track may make it possible to apply more sophisticated audio processing such as pre-emphasis filters, pitch-shifting, and so on to tailor the audio to the user's needs, since hearing loss is typically frequency-dependent, and the user may have usable hearing in some bands yet none at all in others.


Systems supporting clean audio and multiple audio tracks must:

  • (CA-1) Support clean audio as a separate, alternative audio track from other audio-based alternative media resources.
  • (CA-2) Support the synchronisation of multitrack audio either within the same file or from separate files - preferably both.
  • (CA-3) Support separate volume control of the different audio tracks.
  • (CA-4) Support pre-emphasis filters, pitch-shifting, and other audio-processing algorithms.

Content navigation by content structure

Most people are familiar with fast forward and rewind in media content. However, because they progress through content based only on time, fast forward and rewind are ineffective particularly when the content is being used for purposes other than entertainment. People with disabilities are also particularly disadvantaged if forced to rely solely on time-based fast forward and rewind to study content.

Fortunately, most content is structured, and appropriate markup can expose this structure to forward and rewind controls:

  • Books generally have chapters and perhaps subsections within those chapters. They also have structures such as page numbers, side-bars, tables, footnotes, tables of contents, glossaries, etc.
  • Short music selections tend to have verses and repeating choruses.
  • Larger classical-music works have movements which are further dividable by component parts such as exposition, development and recapitulation, or theme and variations.
  • Operas, theatrical plays, and movies have acts and scenes within those acts.
  • Television programs generally have clear divisions; e.g., newscasts have individual stories usually wrapped within a larger structures called news, weather, or sports.
  • A lecturer may first lay out a topic, then consider a series of approaches or illustrative examples, and finally draw a conclusion.

This is, of course, a DOM view of content. However, effective DOM-based navigation will require an additional control not typically available on current media players. This real-time control, which we are calling a "granularity-level control," will allow the user to adjust the level of granularity applied to "next" and "previous" controls. This is necessary because next and previous are too cumbersome if accessing every DOM element, but unsatisfactorally broad and coarse if set to only the top hierarchical DOM level. Allowing the user to adjust the DOM level that next and previous go to has proven very effective--hence the real-time granularity level control.

Two examples of granularity levels

1. In a news broadcast, the most global level (analogous to <h1>) might be the category called "news, weather, and sports." The second level (analogous to <h2>) would identify each individual news (or sports) story. With the granularity control set to level 1, "next" and "previous" would cycle among news, weather, and sports. Set at level 2, it would cycle among individual news (or sports) stories.

2. In a bilingual audiobook-plus-e-text production of Dante Alighieri's "La Divina Commedia," the user would choose whether to listen to the original medieval Italian or its modern-language translation--possibly toggling between them. Meanwhile, both the original and translated texts might appear on screen, with both the original and translated text highlighted, line by line, in sync with the audio narration.

  • The most global (<h1>) level would be each individual book-- "Inferno," "Purgatorio," and "Paradiso."
  • The second (<h2>) level would be each individual canto.
  • The third (<h3>) level would be each individual verso.
  • The fourth (<h4>) level would be each individual line of poetry.

With granularity set at level 1, "next" and "previous" would cycle among the three books of "La Divina Commedia." Set at level 2, they would cycle among its cantos, at level 3 among its versos, and at level 4 among the individual lines of poetry text.

Navigating ancillary content

There is a kind of structure, particularly in longer media resources, which requires special navigational consideration. While present in the media resource, it does not fit in the natural beginning-to-end progression of the resource. Its consumption tends to interrupt this natural beginning-to-end progression. A familiar example is a footnote or sidebar in a book. One must pause reading the text narrative to read a footnote or sidebar. Yet these structures are important and might require their own alternative media renditions. We have chosen to call such structures "ancillary content structures."

Commercials, news briefs, weather updates, etc., are familiar examples from television programming. While so prevalent that most of us may be inured to it, they do function to interrupt the primary television program. Users will want the ability to navigate past these ancillary structures--or perhaps directly to them.

E-text-plus-audio productions of titles such as "La Divina Commedia," described above, may well include reproductions of famous frescoes or paintings interspersed throughout the text, though these are not properly part of the text/content. Such illustrations must be programatically discoverable by users. They also need to be described. However, the user needs the option of choosing when to pause for that interrupting description.

One current HTML 5 media-based example of ancillary content is the Mozilla Popcorn Javascript library and API which can be further explored with the following three resources:

Additional note

Media in HTML5 will be used heavily and broadly. These accessibility user requirements will often find broad applicability.

Just as the structures introduced particularly by nonfiction titles make books more usable, media is more usable when its inherent structure is exposed by markup. Markup-based access to structure is critical for persons with disabilities who cannot infer structure from purely presentational queues.

Structural navigation has proven highly effective in various programs of electronic book publication for persons with print disabilities. Nowadays, these programs are based on the ANSI/NISO Z39.86 specifications. Z39.86 structural navigation is also supported by e-publishing industry specifications.

The user can navigate along the timebase using a continuous scale, and by relative time units within rendered audio and animations (including video and animated images) that last three or more seconds at their default playback rate. (UAAG 2.0 4.9.6?)

The user can navigate by semantic structure within the time-based media, such as by chapters or scenes, if present in the media (UAAG 2.0 4.9.7).


Systems supporting content navigation must:

  • (CN-1) Provide a means to structure media resources so that users can navigate them by semantic content structure, e.g. through adding a track to the video that contains navigation markers (in table-of-content style). This means must allow authors to identify ancillary content structures, which may be a hierarchical structure. Support keeping all media representations synchronised when users navigate.
  • (CN-2) The navigation track should provide for hierarchical structures with titles for the sections.
  • (CN-3) Support both global navigation by the larger structural elements of a media work, and also the most localized atomic structures of that work, even though authors may not have marked-up all levels of navigational granularity.
  • (CN-4) Support third-party provided structural navigation markup.
  • (CN-5) Keep all content representations in sync, so that moving to any particular structural element in media content also moves to the corresponding point in all provided alternate media representations (captions, described video, transcripts, etc) associated with that work.
  • (CN-6) Support direct access to any structural element, possibly through URIs.
  • (CN-7) Support pausing primary content traversal to provide access to such ancillary content in line.
  • (CN-8) Support skipping of ancillary content in order to not interrupt content flow.
  • (CN-9) Support access to each ancillary content item, including with "next" and "previous" controls, apart from accessing the primary content of the title.
  • (CN-10) Support that in bilingual texts both the original and translated texts can appear on screen, with both the original and translated text highlighted, line by line, in sync with the audio narration.


For people who are deaf or hard-of-hearing, captioning is a prime alternative representation of audio. Captions are in the same language as the main audio track and, in contrast to foreign-language subtitles, render a transcription of dialog or narration as well as important non-speech information, such as sound effects, music and laughter. Historically, captions have been either closed or open. Closed captions have been transmitted as data along with the video but were not visible until the user elected to turn them on, usually by invoking an on-screen control or menu selection. Open captions have always been visible; they had been merged with the video track and could not be turned off.

Ideally, captions should be a verbatim representation of the audio; however, captions are sometimes edited for various reasons-- for example, for reading speed or for language level. In general, consumers of captions have expressed that the text should represent exactly what is in the audio track. If edited captions are provided, then they should be clearly marked as such, and the full verbatim version should also be available as an option.

The timing of caption text can coincide with the mouth movement of the speaker (where visible), but this is not strictly necessary. For timing purposes, captions may sometimes precede or extend slightly after the audio they represent. Captioning should also use adequate means to distinguish between speakers as turn-taking occurs during conversation; this has in the past been done by positioning the text near the speaker, by associating different colors to different speakers, or by putting the name and a colon in front of the text line of a speaker.

Captions are useful to a wide array of users in addition to their originally intended audiences. Gyms, bars and restaurants regularly employ captions as a way for patrons to watch television while in those establishments. People learning to read or learning the language of the country where they live as a second language also benefit from captions: research has shown that captions help reinforce vocabulary and language. Captions can also provide a powerful search capability, allowing users and search engines to search the caption text to locate a specific video or an exact point in a video.


Formats for captions, subtitles or foreign-language subtitles must:

  • (CC-1) Render text in a time-synchronized manner, using the media resource as the timebase master.

NOTE: Most of the time, the main audio track would be the best candidate for the timebase. Where a video without audio, but with a text track, is available, the video track becomes the timebase master. Also, there may be situations where an explicit timing track is available.

  • (CC-2) Allow the author to specify erasures, i.e., times when no text is displayed on the screen (no text cues are active).

NOTE: This should be possible both within media resources and caption formats.

  • (CC-3) Allow the author to assign timestamps so that one caption/subtitle follows another, with no perceivable gap in between.

NOTE: This means that caption cues should be able to either let the start time of the subsequent cue be determined by the duration of the cue or have the end time be implied by the start of the next cue. For overlapping captions, explicit start and end times are then required.

  • (CC-4) Be available in a text encoding.

NOTE: This means that determined character encodings should be supported - which could be either by making the character encoding explicit or by enforcing a single default one such as UTF-8.

  • (CC-5) Support positioning in all parts of the screen - either inside the media viewport but also possibly in a determined space next to the media viewport. This is particularly important when multiple captions are on screen at the same time and relate to different speakers, or when in-picture text is avoided.

NOTE: The minimum requirement is a bounding box (with an optional background) into which text is flowed, and that probably needs to be pixel aligned. The absolute position of text within the bounding box is less critical, although it is important to be able to avoid bad word-breaks and have adequate white space around letters and so on. There is more on this in a separate requirement.

The caption format could provide a min-width/min-height for its bounding box, which typically is calculated from the bottom of the video viewport, but can be placed elsewhere by the Web page, with the Web page being able to make that box larger and scale the text relatively, too. The positions inside the box should probably be into regions, such as top, right, bottom, left, center.

  • (CC-6) Support the display of multiple regions of text simultaneously.

NOTE: This typically relates to multiple text cues that are defined on overlapping times. If the cues' rendering target are made out to different spatial regions, they can be displayed simultaneously.

  • (CC-7) Display multiple rows of text when rendered as text in a right-to-left or left-to-right language.

NOTE: Internationalization is important not just for subtitles, as captions can be used in all languages.

  • (CC-8) Allow the author to specify line breaks.
  • (CC-9) Permit a range of font faces and sizes.
  • (CC-10) Render a background in a range of colors, supporting a full range of opacities.
  • (CC-11) Render text in a range of colors.

NOTE: The user should have final control over rendering styles like color and fonts; e.g., through user preferences.

  • (CC-12) Enable rendering of text with a thicker outline or a drop shadow to allow for better contrast with the background.
  • (CC-13) Where a background is used, it is preferable to keep the caption background visible even in times where no text is displayed, such that it minimises distraction. However, where captions are infrequent the background should be allowed to disappear to enable the user to see as much of the underlying video as possible.

NOTE: It may be technically possible to have cues without text.

  • (CC-14) Allow the use of mixed display styles-- e.g., mixing paint-on captions with pop-on captions-- within a single caption cue or in the caption stream as a whole. Pop-on captions are usually one or two lines of captions that appear on screen and remain visible for one to several seconds before they disappear. Paint-on captions are individual characters that are "painted on" from left to right, not popped onto the screen all at once, and usually are verbatim. Another often-used caption style in live captioning is roll-up - here, cue text follows double chevrons ("greater than" symbols), and are used to indicate different speaker identifications. Each sentence "rolls up" to about three lines. The top line of the three disappears as a new bottom line is added, allowing the continuous rolling up of new lines of captions.

NOTE: Similarly, in karaoke, individual characters are often "painted on".

  • (CC-15) Support positioning such that the lowest line of captions appears at least 1/12 of the total screen height above the bottom of the screen, when rendered as text in a right-to-left or left-to-right language.
  • (CC-16) Use conventions that include inserting left-to-right and right-to-left segments within a vertical run (e.g. Tate-chu-yoko in Japanese), when rendered as text in a top-to-bottom oriented language.
  • (CC-17) Represent content of different natural languages. In some cases the inclusion of a few foreign words form part of the original soundtrack, and thus need to be in the same caption resource. Also allow for separate caption files for different languages and on-the-fly switching between them. This is also a requirement for subtitles.

NOTE: Caption/subtitle files that are alternatives in different languages are probably best provided in different caption resources and are user selectable. Realistically, having no more than 2 languages present at the same time on screen is probably the limit.

  • (CC-18) Represent content of at least those specific natural languages that may be represented with [Unicode 3.2], including common typographical conventions of that language (e.g., through the use of furigana and other forms of ruby text).
  • (CC-19) Present the full range of typographical glyphs, layout and punctuation marks normally associated with the natural language's print-writing system.
  • (CC-20) Permit in-line mark-up for foreign words or phrases.

NOTE: Italics markup may be sufficient for a human user, but it is important to be able to mark up languages so that the text can be rendered correctly, since the same Unicode can be shared between languages and rendered differently in different contexts. This is mainly an I18n issue. It is also important for audio rendering, to get pronunciation correct.

  • (CC-21) Permit the distinction between different speakers.

Further, systems that support captions must:

  • (CC-22) Support captions that are provided inside media resources as tracks, or in external files.

NOTE: It is desirable to expose the same API to both.

  • (CC-23) Ascertain that captions are displayed in sync with the media resource.
  • (CC-24) Support user activation/deactivation of caption tracks.

NOTE: This requires a menu of some sort that displays the available tracks for activation/deactivation.

  • (CC-25) Support edited and verbatim captions, if available.

NOTE: Edited and verbatim captions can be provided in two different caption resources. There is a need to expose to the user how they differ, similar to how there can be caption tracks in different languages.

  • (CC-26) Support multiple tracks of foreign-language subtitles in different languages.

NOTE: These different-language "tracks" can be provided in different resources.

  • (CC-27) Support live-captioning functionality.

Enhanced captions/subtitles

Enhanced captions are timed text cues that have been enriched with further information - examples are glossary definitions for acronyms and other intialisms, foreign terms (for example, Latin), jargon or descriptions for other difficult language. They may be age-graded, so that multiple caption tracks are supplied, or the glossary function may be added dynamically through machine lookup.

Glossary information can be added in the normal time allotted for the cue (e.g. as a callout or other overlay), or it might take the form of a hyperlink that, when activated, pauses the main content and allows access to more complete explanatory material.

Such extensions can provide important additional information to the content that will enable or improve the understanding of the main content to accessibility users. Enhanced text cues will be particularly useful for those with restricted reading skills, to subtitle users, and to caption users. Users may often come across keywords in text cues that lend themselves to further in-depth information or hyperlinks, such as an e-mail contact or phone number for a person, a strange term that needs a Wikipedia link for definition, or an idiom that needs comments to explain it to a foreign-language speaker.


Systems that support enhanced captions must:

  • (ECC-1) Support metadata markup for (sections of) timed text cues.

NOTE: Such "metadata" markup can be realised through a @title attribute on a <span> of the text, or a hyperlink to another location where a term is explained, an <abbr> element, an <acronym> element, a <dfn> element, or through RDFa or microdata.

  • (ECC-2) Support hyperlinks and other activation mechanisms for supplementary data for (sections of) caption text.

NOTE: This can be realised through inclusion of <a> elements or buttons into timed text cues, where additional overlays could be created or a different page be loaded. One needs to deal here with the need to pause the media timeline for reading of the additional information.

  • (ECC-3) Support text cues that may be longer than the time available until the next text cue and thus provide overlapping text cues - in this case, a feature should be provided to decide if overlap is ok or should be cut or the media resource be paused while the caption is displayed. Timing would be provided by the author, but with the user being able to override it.

NOTE: This feature is analogous to extended video descriptions - where timing for a text cue is longer than the available time for the cue, it may be necessary to halt the media to allow for more time to read back on the text and its additional material. In this case, the pause is dependent on the user's reading speed, so this may imply user control or timeouts.

  • (ECC-4) It needs to be possible to define timed text cues that are allowed to overlap with each other in time and be present on screen at the same time (e.g., those that come from speech of different speakers), and such that are not allowed to overlap and thus cause media playback pause to allow users to catch up with their reading.

NOTE: This could be realised through a hint on the text cue or even for a whole track.

  • (ECC-5) Allow users to define the reading speed and thus define how long each text cue requires, and whether media playback needs to pause sometimes to let them catch up on their reading.

NOTE: This can be a setting in the UA, which will define user-interface behavior.

Sign translation

Sign language shares the same concept as captioning: it presents both speech and non-speech information in an alternative format. Note that due to the wide regional variation in signing systems (e.g., American Sign Language vs British Sign Language), sign translation may not be appropriate for content with a global audience unless localized variants can be made available.

Signing can be open, mixed with the video and offered as an entirely alternate stream or closed (using some form of picture-in-picture or alpha-blending technology). It is possible to use quite low bit rates for much of the signing track, but it is important that facial, arm, hand and other body gestures be delivered at sufficient resolution to support legibility. Animated avatars may not currently be sufficient as a substitute for human signers, although research continues in this area and it may become practical at some point in the future.

Acknowledging that not all devices will be capable of handling multiple video streams, this is a SHOULD requirement for browsers where hardware is capable of support. Strong authoring guidance for content creator will mitigate situations where user-agents are unable to support multiple video streams (WCAG) - for example, on mobile devices that cannot support multiple streams, authors should be encouraged to offer two versions of the media stream, including one with signed captions burned into the media.

Selecting from multiple tracks for different sign languages should be achieved in the same fashion that multiple caption/subtitle files are handled.


Systems supporting sign language must:

  • (SL-1) Support sign-language video either as a track as part of a media resource or as an external file.
  • (SL-2) Support the synchronized playback of the sign-language video with the media resource.
  • (SL-3) Support the display of sign-language video either as picture-in-picture or alpha-blended overlay, as parallel video, or as the main video with the original video as picture-in-picture or alpha-blended overlay. Parallel video here means two discrete videos playing in sync with each other. It is preferable to have one discrete <video> element contain all pieces for sync purposes rather than specifying multiple <video> elements intended to work in sync.
  • (SL-4) Support multiple sign-language tracks in several sign languages.
  • (SL-5) Support the interactive activation/deactivation of a sign-language track by the user.


While synchronized captions are generally preferable for people with hearing impairments, for some users they are not viable – those who are deaf-blind, for example, or those with cognitive or reading impairments that make it impossible to follow synchronized captions. And even with ordinary captions, it is possible to miss some information as the captions and the video require two separate loci of attention. The full transcript supports different user needs and is not a replacement for captioning. A transcript can either be presented simultaneously with the media material, which can assist slower readers or those who need more time to reference context, but it should also be made available independently of the media.

A full text transcript should include information that would be in both the caption and video description, so that it is a complete representation of the material, as well as containing any interactive options.


Systems supporting transcripts must:

  • (T-1) Support the provisioning of a full text transcript for the media asset in a separate but linked resource, where the linkage is programatically accessible to AT.
  • (T-2) Support the provisioning of both scrolling and static display of a full text transcript with the media resource, e.g. in a area next to the video or underneath the video, which is also AT accessible.

System Requirements

Access to interactive controls / menus

Media elements offer a rich set of interaction possibilities to users. These interaction possibilities must be available to all users, including those that cannot use a pointer device for interaction. Further, these interaction possibilities must be available to all users for all means in which the controls are exposed - no matter whether they are exposed by the user agent, or are scripted. Further, the interaction possibilities need to be rich enough to allow all users fine grained control over media playback.

It is imperative that controls be device independent, so that control may be achieved by keyboard, pointing device, speech, etc.


Systems supporting keyboard accessibility must:

  • (KA-1) Support operation of all functionality via the keyboard on systems where a keyboard is (or can be) present, and where a unique focus object is employed. This does not forbid and should not discourage providing mouse input or other input methods in addition to keyboard operation. (UAAG 2.0 4.1.1)

NOTE: This means that all interaction possibilities with media elements need to be keyboard accessible; e.g., through being able to tab onto the play, pause, mute buttons, and to move the playback position from the keyboard.

  • (KA-2) Support a rich set of native controls for media operation, including but not limited to play, pause, stop, jump to beginning, jump to end, scale player size (up to full screen), adjust volume, mute, captions on/off, descriptions on/off, selection of audio language, selection of caption language, selection of audio description language, location of captions, size of captions, video contrast/brightness, playback rate, content navigation on same level (next/prev) and between levels (up/down) etc. This is also a particularly important requirement on mobile devices or devices without a keyboard.

NOTE: This means that the @controls content attribute needs to provide an extended set of control functionality including functionality for accessibility users.

  • (KA-3) All functionality available to native controls must also be available to scripted controls. The author would be able to choose any/all of the controls, skin them and position them.

NOTE: This means that new IDL attributes need to be added to the media elements for the extra controls that are accessibility related.

  • (KA-4) It must always be possible to enable native controls regardless of the author preference to guarantee that such functionality is available and essentially override author settings through user control. This is also a particularly important requirement on mobile devices or devices without a keyboard.

NOTE: This could be enabled through a context menu, which is keyboard accessible and its keyboard access cannot be turned off.

  • (KA-5) The scripted and native controls must go through the same platform-level accessibility framework (where it exists), so that a user presented with the scripted version is not shut out from some expected behaviour.

NOTE: This is below the level of HTML and means that the accessibility platform needs to be extended to allow access to these controls.

  • (KA-6) Autoplay on media elements is a particularly difficult issue to manage for vision-impaired users, since the mouse allows other users to an auto-playing element on a page with a single interaction. Therefore, autoplay state needs to be exposed to the platform-level accessibility framework. The vision-impaired user must be able to stop autoplay either generally on all media elements through a setting, or for particular pages through a single keyboard user interaction.

NOTE: This could be enabled through encouraging publishers to us @autoplay, encouraging UAs to implement accessibility settings that allow to turn off all autoplay, and encouraging AT to implement a shortcut key to stop all autoplay on a Web page.

Granularity level control for structural navigation

As explained in "Content Navigation" above, a real-time control mechanism must be provided for adjusting the granularity of the specific structural navigation point next and previous. Users must be able to set the range/scope of next and previous in real time.


  • (CNS-1) All identified structures, including ancillary content as defined in "Content Navigation" above, must be accessible with the use of "next" and "previous," as refined by the granularity control.
  • (CNS-2) Users must be able to discover, skip, play-in-line, or directly access ancillary content structures.
  • (CNS-3) Users need to be able to access the granularity control using any input mode, e.g. keyboard, speech, pointer, etc.
  • (CNS-4) Producers and authors may optionally provide additional access options to identified structures, such as direct access to any node in a table of contents.

Time-scale modification

While all devices may not support the capability, a standard control API must support the ability to speed up or slow down content presentation without altering audio pitch.

NOTE: While perhaps unfamiliar to some, this feature has been present on many devices, especially audiobook players, for some 20 years now.


The user can adjust the playback rate of prerecorded time-based media content, such that all of the following are true (UAAG 2.0 4.9.5):

  • (TSM-1) The user can adjust the playback rate of the time-based media tracks to between 50% and 250% of real time.
  • (TSM-2) Speech whose playback rate has been adjusted by the user maintains pitch in order to limit degradation of the speech quality.
  • (TSM-3) All provided alternative media tracks remain synchronized across this required range of playback rates.
  • (TSM-4) The user agent provides a function that resets the playback rate to normal (100%).
  • (TSM-5) The user can stop, pause, and resume rendered audio and animation content (including video and animated images) that last three or more seconds at their default playback rate. (UAAG 2.0 4.9.6)

Production practice and resulting requirements

One of the biggest problems to date has been the lack of a universal system for media access. In response to user requirements various countries and groups have defined systems to provide accessibility, especially captioning for television. However these systems are typically not compatible. In some cases the formats can be inter-converted, but some formats – for example DVD sub-pictures – are image based and are difficult to convert to text.

Caption formats are often geared towards delivery of the media, for example as part of a television broadcast. They are not well suited to the production phases of media creation. Media creators have developed their own internal formats which are more amenable to the editing phase, but to date there has been no common format that allows interchange of this data.

Any media based solution should attempt to reduce as far as possible layers of translation between production and delivery.

In general captioners use a proprietary workstation to prepare caption files; these can often export to various standard broadcast ingest formats, but in general files are not inter-convertible. Most video editing suites are not set up to preserve captioning, and so this has typically to be added after the final edit is decided on; furthermore since this work is often outsourced, the copyright holder may not hold the final editable version of the captions. Thus when programming is later re-purposed, e.g. a shorter edit is made, or a ‘directors cut’ produced, the captioning may have to be redone in its entirety. Similarly, and particularly for news footage, parts of the media may go to web before the final TV edit is made, and thus the captions that are produced for the final TV edit are not available for the web version.

It is important when purchasing or commissioning media, that captioning and described video is taken into account and made equal priority in terms of ownership, rights of use, etc., as the video and audio itself.

This is primarily an authoring requirement. It is a understood that a common time-stamp format must be declared in HTML5, so that authoring tools can conform to a required output.


Systems supporting accessibility needs for media must:

  • (PP-1) Support existing production practice for alternative content resources, in particular allow for the association of separate alternative content resources to media resources. Browsers cannot support all forms of time-stamp formats out there, just as they cannot support all forms of image formats (etc.). This necessitates a clear and unambiguous declared format, so that existing authoring tools can be configured to export finished files in the required format.
  • (PP-2) Support the association of authoring and rights metadata with alternative content resources, including copyright and usage information.
  • (PP-3) Support the simple replacement of alternative content resources even after publishing. This is again dependent on authoring practice - if the content creator delivers a final media file that contains related accessibility content inside the media wrapper (for example an MP4 file), then it will require an appropriate third-party authoring tool to make changes to that file - it cannot be demanded of the browser to do so.
  • (PP-4) Typically, alternative content resources are created by different entities to the ones that create the media content. They may even be in different countries and not be allowed to re-publish the other one's content. It is important to be able to host these resources separately, associate them together through the Web page author, and eventually play them back synchronously to the user.

Discovery and activation/deactivation of available alternative content by the user

As described above, individuals need a variety of media (alternative content) in order to perceive and understand the content. The author or some Web mechanism provides the alternative content. This alternative content may be part of the original content, embedded within the media container as 'fallback content', or linked from the original content. The user is faced with discovering the availability of alternative content.

Alternative content must be both discoverable by the user, and accessible in device agnostic ways. The development of APIs and user-agent controls should adhere to the following UAAG guidance:


The user agent can facilitate the discovery of alternative content by following these criteria:

  • (DAC-1) The user has the ability to have indicators rendered along with rendered elements that have alternative content (e.g., visual icons rendered in proximity of content which has short text alternatives, long descriptions, or captions). In cases where the alternative content has different dimensions than the original content, the user has the option to specify how the layout/reflow of the document should be handled. (UAAG 2.0 3.1.1).
  • (DAC-2) The user has a global option to specify which types of alternative content by default and, in cases where the alternative content has different dimensions than the original content, how the layout/reflow of the document should be handled. (UAAG 2.0 3.1.2).
  • (DAC-3) The user can browse the alternatives and switch between them.
  • (DAC-4) Synchronized alternatives for time-based media (e.g., captions, descriptions, sign language) can be rendered at the same time as their associated audio tracks and visual tracks (UAAG 2.0 3.1.3).
  • (DAC-5) Non-synchronized alternatives (e.g., short text alternatives, long descriptions) can be rendered as replacements for the original rendered content (UAAG 2.0 3.1.3).
  • (DAC-6) Provide the user with the global option to configure a cascade of types of alternatives to render by default, in case a preferred alternative content type is unavailable (UAAG 2.0 3.1.4).
  • (DAC-7) During time-based media playback, the user can determine which tracks are available and select or deselect tracks. These selections may override global default settings for captions, descriptions, etc. (UAAG 2.0 4.9.8)
  • (DAC-8) Provide the user with the option to load time-based media content such that the first frame is displayed (if video), but the content is not played until explicit user request. (UAAG 2.0 4.9.2)

Requirements on making properties available to the accessibility interface

Often forgotten in media systems, especially with the newer forms of packaging such as DVD menus and on-screen program guides, is the fact that the user needs to actually get to the content, control its playback, and turn on any required accessibility options. For user agents supporting accessibility APIs implemented for a platform, any media controls need to be connected to that API.

On self-contained products that do not support assistive technology, any menus in the content need to provide information in alternative formats (e.g., talking menus). Products with a separate remote control, or that are self-contained boxes, should ensure the physical design does not block access, and should make accessibility controls, such as the closed-caption toggle, as prominent as the volume or channel controls.


  • (API-1) The existence of alternative-content tracks for a media resource must be exposed to the user agent.
  • (API-2) Since authors will need access to the alternative content tracks, the structure needs to be exposed to authors as well, which requires a dynamic interface.
  • (API-3) Accessibility APIs need to gain access to alternative content tracks no matter whether those content tracks come from within a resource or are combined through markup on the page.

Requirements on the use of the viewport

The video viewport plays a particularly important role with respect to alternative-content technologies. Mostly it provides a bounding box for many of the visually represented alternative-content technologies (e.g., captions, hierarchical navigation points, sign language), although some alternative content does not rely on a viewport (e.g. full transcripts, descriptive video).

One key principle to remember when designing player ‘skins’ is that the lower-third of the video may be needed for caption text. Caption consumers rely on being able to make fast eye movements between the captions and the video content. If the captions are in a non-standard place, this may cause viewers to miss information. The use of this area for things such as transport controls, while appealing aesthetically, may lead to accessibility conflicts.


  • (VP-1) It must be possible to deal with three different cases for the relation between the viewport size, the position of media and of alternative content:
    • (a) the alternative content's extent is specified in relation to the media viewport (e.g., picture-in-picture video, lower-third captions)
    • (b) the alternative content has its own independent extent, but is positioned in relation to the media viewport (e.g., captions above the audio, sign-language video above the audio, navigation points below the controls)
    • (c) the alternative content has its own independent extent and doesn't need to be rendered in any relation to the media viewport (e.g., text transcripts)

If alternative content has a different height or width than the media content, then the user agent will reflow the (HTML) viewport. (UAAG 2.0 3.1.4).

NOTE: This may create a need to provide an author hint to the Web page when embedding alternate content in order to instruct the Web page how to render the content: to scale with the media resource, scale independently, or provide a position hint in relation to the media. On small devices where the video takes up the full viewport, only limited rendering choices may be possible, such that the UA may need to override author preferences.

  • (VP-2) The user can change the following characteristics of visually rendered text content, overriding those specified by the author or user-agent defaults (UAAG 2.0 3.6.1). (Note: this should include captions and any text rendered in relation to media elements, so as to be able to magnify and simplify rendered text):
    • (a) text scale (i.e., the general size of text) ,
    • (b) font family, and
    • (c) text color (i.e., foreground and background).

NOTE: This should be achievable through UA configuration or even through something like a greasemonkey script or user CSS which can override styles dynamically in the browser.

  • (VP-3) Provide the user with the ability to adjust the size of the time-based media up to the full height or width of the containing viewport, with the ability to preserve aspect ratio and to adjust the size of the playback viewport to avoid cropping, within the scaling limitations imposed by the media itself. (UAAG 2.0 4.9.9)

NOTE: This can be achieved by simply zooming into the Web page, which will automatically rescale the layout and reflow the content.

  • (VP-4) Provide the user with the ability to control the contrast and brightness of the content within the playback viewport. (UAAG 2.0 4.9.11)

NOTE: This is a user-agent device requirement and should already be addressed in the UAAG. In live content, it may even be possible to adjust camera settings to achieve this requirement. It is also a "SHOULD" level requirement, since it does not account for limitations of various devices.

  • (VP-5) Captions and subtitles traditionally occupy the lower third of the video, where also controls are also usually rendered. The user agent must avoiding overlapping of overlay content and controls on media resources. This must also happen if, for example, the controls are only visible on demand.

NOTE: If there are several types of overlapping overlays, the controls should stay on the bottom edge of the viewport and the others should be moved above this area, all stacked above each other.

Requirements on the parallel use of alternate content on potentially multiple devices in parallel

Multiple user devices must be directly addressable. It must be assumed that many users will have multiple video displays and/or multiple audio-output devices attached to an individual computer, or addressable via LAN. It must be possible to configure certain types of media for presentation on specific devices, and these configuration settings must be readily overwritable on a case-by-case basis by users.

(A request to the UAAG on clarifications to a number of these points was made, and a detailed response was provided. The response requires review and integration into this document, but can be found today at:


Systems supporting multiple devices for accessibility must:

  • (MD-1) Support a platform-accessibility architecture relevant to the operating environment. (UAAG 2.0 2.1.1)
  • (MD-2) Ensure accessibility of all user-interface components including the user interface, rendered content, and alternative content; make available the name, role, state, value, and description via a platform-accessibility architecture. (UAAG 2.0 2.1.2)
  • (MD-3) If a feature is not supported by the accessibility architecture(s), provide an equivalent feature that does support the accessibility architecture(s). Document the equivalent feature in the conformance claim. (UAAG 2.0 2.1.3)
  • (MD-4) If the user agent implements one or more DOMs, they must be made programmatically available to assistive technologies. (UAAG 2.0 2.1.4) This assumes the video element will write to the DOM.
  • (MD-5) If the user can modify the state or value of a piece of content through the user interface (e.g., by checking a box or editing a text area), the same degree of write access is available programmatically (UAAG 2.0 2.1.5).
  • (MD-6) If any of the following properties are supported by the accessibility-platform architecture, make the properties available to the accessibility-platform architecture (UAAG 2.0 2.1.6):
    • (a) the bounding dimensions and coordinates of rendered graphical objects;
    • (b) font family;
    • (c) font size;
    • (d) text foreground color;
    • (e) text background color;
    • (f) change state/value notifications.
  • (MD-7) Ensure that programmatic exchanges between APIs proceed at a rate such that users do not perceive a delay. (UAAG 2.0 2.1.7).