This Wiki page is edited by participants of the HTML Accessibility Task Force. It does not necessarily represent consensus and it may have incorrect information or information that is not supported by other Task Force participants, WAI, or W3C. It may also have some very useful information.

Media TextAssociations Requirements

From HTML accessibility task force Wiki
Jump to: navigation, search

Requirements for File Formats for External Text Associations for Media Resources

This page summarises the requirements that the Media Accessibility TF collected for synchronised text alternatives for media resources, in particular for caption formats.

The reason for this collection is that many assume that a simple text format that consists of a sequence of [start, end, text] tuples is sufficient, while others strongly object.

Legal requirements for (media) accessibility

US: Section 508 There is a particular section for Video or Multimedia Products (1194.24) in Section 508. It asks for user-selectable alternative text and audio description support. Further, there is a section for Web-based intranet and internet information and applications (1194.22) in Section 508. It asks for equivalent synchronized alternatives for multimedia presentations and relates this to WCAG1.0 Checkpoint 1.4.

UK: Disability Discrimination Act (DDA) "It's widely believed that if, or perhaps more appropriately when, a case makes it to court that the W3C accessibility guidelines will be used to assess a website's accessibility and ultimately decide the outcome of the case." It seems that WCAG1.0 priority 1 is required and priority 2 encouraged (only compulsory for government departments).

EU: Council of the European Union resolution "European institutes and member state governments are asked to fulfill priority 1 as well as priority 2 of the W3C/WCAG guidelines."

[add details for your favorite country here...]


WCAG 1.0 priority 2 seems thus far to be the highest requirement 

WCAG requirements

WCAG 1.0:

"Guideline 1. Provide equivalent alternatives to auditory and visual content."

Checkpoint 1.1: asks for text equivalents (priority 1) Text equivalents for multimedia are listed as:

  • text transcript (non-synchronised transcription of audio track),
  • caption (synchronised),
  • collated text transcript (captions with scene information)

Checkpoint 1.3: asks for auditory descriptions (priority 1) "Until user agents can automatically read aloud the text equivalent of a visual track, provide an auditory description of the important information of the visual track of a multimedia presentation."

Checkpoint 1.4: asks for synchronised captions and auditory descriptions (priority 1) Basically repeats Checkpoints 1.1 and 1.3

WCAG 2.0:

Principle 1: Perceivable - Information and user interface components must be presentable to users in ways they can perceive. Guideline 1.2 Time-based Media: Provide alternatives for time-based media.

Level A requirements:

  • 1.2.1 pre-recorded audio-only: non-synchronised text equivalent
  • 1.2.1 pre-recorded video-only: non-synchronised text equivalent or audio description
  • 1.2.2 pre-recorded audio-visual: captions
  • 1.2.3 pre-recorded audio-visual: audio description or media alternative

Level AA requirements:

  • 1.2.4 live audio-visual: live captions
  • 1.2.5 pre-recorded audio-visual: audio description

Level AAA requirements:

  • 1.2.6 pre-recorded audio-visual: sign language
  • 1.2.7 pre-recorded audio-visual: extended audio description (incl pauses to original video)
  • 1.2.8 pre-recorded audio-visual & video: media alternative required (text equivalent)
  • 1.2.9 live audio-only: media alternative required (live captions or text equivalent)

Satisfying WCAG1.0, Checkpoint 1.1

Provide a non-synchronised text transcript:

  • could be done by putting text next to the video or audio element
  • could be done by putting a link to a text transcript file next to the video or audio element

These should best be linked to the video using aria-describedBy

Provide captions: "A caption is a text transcript for the audio track of a video presentation that is synchronized with the video and audio tracks. Captions are generally rendered visually by being superimposed over the video, which benefits people who are deaf and hard-of-hearing, and anyone who cannot hear the audio (e.g., when in a crowded room)."

A simple [start,end,text] collection will suffice to satisfy this requirement.

Satisfying WCAG1.0, Checkpoint 1.3

Provide auditory descriptions: "Auditory descriptions of the visual track provide narration of the key visual elements without interfering with the audio or dialogue of a movie. Key visual elements include actions, settings, body language, graphics, and displayed text. Auditory descriptions are used primarily by people who are blind to follow the action and other non-auditory information in video material."

"The description is either a prerecorded human voice or a synthesized voice (recorded or generated on the fly). The auditory description is synchronized with the audio track of the presentation, usually during natural pauses in the audio track. Auditory descriptions include information about actions, body language, graphics, and scene changes."

As simple [start,end,text] collection will suffice to synthesize a voice for audio descriptions.

Satisfying WCAG1.0, Checkpoint 1.4

Provide caption (see Checkpoint 1.1) Provide auditory description (see Checkpoint 1.3)

WCAG 1.0 does not state any requirements for more than [start,end,text].


It seems none of the WCAG recommendations and requirements prescribes more than just captions where captions are referred to as a 
"text transcipt" that is synchronised to the audio. No formatting requirements are stated.

Capabilities of related Technology


The US Federal Communications Commission (FCC) DTV Decoder Standards, adopted in July of 2001, lays out the features that DTV CC decoders in the US must support. The full doc itself is at The decoder requirements were created after the FCC received numerous comments from the deaf and hard-of-hearing community over the importance of having both authorial and user access to styling features. The DTV Decoder Standards ensure that CEA-708 captions containing specific styling features (e.g., foreground/background color, translucency, font face and size, etc.) will appear on digital televisions as the author intended.

Summary of Requirements:

Decoder Operation

The Order adopts the requirement of Section 9 of EIA-708, with the following modifications:

  • Decoders must support the standard, large, and small caption sizes and must allow the caption provider to choose a size and allow the viewer to choose an alternative size.
  • Decoders must support the eight fonts listed in EIA-708.(1) Caption providers may specify 1 of these 8 font styles to be used to write caption text. Decoders must include the ability for consumers to choose among the eight fonts. The decoder must display the font chosen by the caption provider unless the viewer chooses a different font.
  • Decoders must implement the same 8 character background colors as those that Section 9 requires be implemented for character foreground (white, black, red, green, blue, yellow, magenta and cyan).
  • Decoders must implement options for altering the appearance of caption character edges.
  • Decoders must display the color chosen by the caption provider, and must allow viewers to override the foreground and/or background color chosen by the caption provider and select alternate colors.
  • Decoders must be capable of decoding and processing data for the six standard services, but information from only one service need be displayed at a given time.
  • Decoders must include an option that permits a viewer to choose a setting that will display captions as intended by the caption provider (a default). Decoders must also include an option that allows a viewer's chosen settings to remain until the viewer chooses to alter these settings, including during periods when the television is turned off.
  • Cable providers and other multichannel video programming distributors must transmit captions in a format that will be understandable to this decoder circuitry in digital cable television sets when transmitting programming to digital television devices.

Covered Devices

  • All digital television receivers with picture screens in the 4:3 aspect ratio measuring at least 13 inches diagonally, digital television receivers with picture screens in the 16:9 aspect ratio measuring 7.8 inches or larger vertically (this size corresponds to the vertical height of an analog receiver with a 13 inch diagonal), and all DTV tuners, shipped in interstate commerce or manufactured in the United States must comply with the minimum decoder requirements we are adopting here.
  • The rules apply to DTV tuners whether or not they are marketed with display screens.
  • Converter boxes used to display digital programming on analog receivers must deliver the encoded "analog" caption information to the attached analog receiver.

Compliance Dates

  • Manufacturers must begin to include DTV closed caption functionality in DTV devices in accordance with the rules adopted in the Order by July 1, 2002.
  • As provided for in the Commission's rules establishing requirements for the closed captioning of video programming adopted in a 1997 Order, programming prepared or formatted for display on digital television receivers before the date that digital television decoders are required to be included in digital television devices is considered "pre-rule" programming. As stated above, this order establishes that date as July 1, 2002. Therefore, programming prepared or formatted for display on digital television after that date will be considered new programming. The existing rules require an increasing amount of captioned new programming over an eight-year transition period with 100% of all new nonexempt programming required to be captioned by January 1, 2006.

(The eight font styles are defined as follows:

  • default (undefined),
  • monospaced with serifs (similar to Courier),
  • proportionally spaced with serifs (similar to Times New Roman),
  • monospaced without serifs (similar to Helvetica Monospaced),
  • proportionally spaced without serifs (similar to Arial and Swiss),
  • casual font type (similar to Dom and Impress),
  • cursive font type (similar to Coronet and Marigold), and
  • small capitals (similar to Engravers Gothic).

In parentheses following each font style is a reference to one or more fonts which are similar to the style.

Broadcast or Cable TV

In the US, providers of online media originating on broadcast or cable TV are complying with the existing requirements which were based on significant public input, and have initiated a technical working group within SMPTE to assure that they can “author once, use often” in terms of the captions they are paying for. These captions are presently automatically translated to CEA-708 caption files with full stylistic mark-up that caption providers would like to see preserved when they are transformed to other delivery formats, rather than be discarded because the target format doesn’t support these features.


In terms of compatibility with newer standards, note that the Advanced Television Systems Committee (ATSC) Mobile DTV Standard (A/153; includes support for the transmission of CEA-708 captions, which contains a range of styling features. SMPTE-TT will also contain a wide range of caption-style features.


Analysing existing government regulations for other audio/video publishing devices, it can be expected that there will be requirements for
formatting, styling, fonts, and positioning of captions for any medium, including the Web. Further, as such features are available for other
devices, it makes sense to preserve such features when crossing media devices. There is clearly a need for such features.

User requirements: captions

Accessibility users are special users. They have a very unique way of perceiving the world.

For example, deaf people speak a sign language of their local community. The sign language is often their first language. The language that their schooling material comes in, e.g. English, is often their second language in which they are not as fluent. Thus, reading capabilities for captions provided in, e.g. English, may be much slower than of a native English speaker. Thus, it may be necessary for there to be alternative captions based on speaker capabilities. Also, captions that are formatted, e.g. with text colors that relate to speakers, or with text positioned under the actual speaker, or with sound effects marked in italics, or with important content marked bold, will help perceive the captions faster and will therefore be extremely useful.

Another thing to keep in mind is that the deaf/hard-of-hearing community has asked specifically for styling features in captions. Organizations such as Telecommunications for the Deaf, Self Help for Hard of Hearing People, National Association for the Deaf and the AG Bell Association for the Deaf all wrote to the FCC in support of things like foreground/background color, text size and location. The fact that you do not see these features in wide use today is not a reason to preclude them from future use. It would be wise to consult with expert users in the deaf and hard-of-hearing community before discarding the capacity for a degree of stylistic mark-up.

The SMPTE working group has established a liaison relationship with the Coalition of Organizations for Accessible Technology, a group of interested parties (numbering more than 300 US national and regional organizations serving deaf or blind people). The members of this group would certainly have an interest in this discussion.


Relevant user groups have a strong requirement for styling features. There are organisations to ascertain such features are
provided and to work with standards bodies on specifying such features.