This Wiki page is edited by participants of the HTML Accessibility Task Force. It does not necessarily represent consensus and it may have incorrect information or information that is not supported by other Task Force participants, WAI, or W3C. It may also have some very useful information.

Media TextAssociations

From HTML accessibility task force Wiki
Jump to: navigation, search

New Declarative Syntax for Associating Synchronized Text to Media Elements


This is a proposal to extend the HTML5 declarative markup for media elements with markup to reference external associated and time-synchronous text resources. The aim is in particular to reference in a standard way external captions, subtitle, and possibly textual audio descriptions, as well as possibly other time-aligned text such as lyrics, karaoke, or ticker text.

Similar markup could potentially be used in future for external synchronised audio and video resources too, but it is too early to experiment with these.


On 16th/17th Feb 2010 the media subgroup of the W3C HTML5 Accessibility Task Force had a phone discussion about this proposal and agreed that it was in principle ready for experimentation and trial implementations.

Related Bugs


WCAG 2.0 recommends a large number of alternative representations for audio-visual content for accessibility purposes. Amongst them is synchronized text, which is text that transcribes/describes what is being said or is happening in the audio-visual resource. Examples are captions (as alternative for the audio track) and textual audio descriptions (as alternative for the video track, read out by a screen reader or transferred to a braille device).

There are also other forms of associated synchronized text for media elements which add onto a media element, in particular subtitles for internationalisation, or song lyrics, and even Karaoke.

Right now there is no standard means of associating them with a media element and displaying them synchronously with the media data.

Related Proposals

This proposal tries to bring all these proposals together.


The Markup

The track element:

   interface HTMLTrackElement : HTMLElement {
          attribute DOMString src;
          attribute DOMString name;
          attribute DOMString role;
          attribute DOMString type;
          attribute DOMString media;
          attribute DOMString language;
          attribute boolean   enabled;

The track element allows authors to specify multiple alternative associated text resources for media elements. These are treated like virtual tracks that can be added to the active media resource.

The external text resource is expected to consist of a sequence of time intervals with associated text and potentially layout, styling, and animation information for the text in a format that the UA understands.

The text track is synchrnoised to the parent audio or video element's active resource's timeline, which is the only relevant timeline. In particular, if a text track is longer than the resource's timeline, anything that is beyond the end of the resource's timeline is ignored.

The text in a text interval is displayed while the active resource's currentTime is between the start time of the interval but has not yet reached the end time of the interval (a semi-open interval: [start,end) ).

If a frame of the video is displayed which has an associated active text interval, the text of that text interval needs to be visible and accessible.

The @src attribute gives the address of the text resource to associate. The attribute, if present, must contain a valid URL.

The @name attributes allows the author to provide a short, descriptive name for the track, which can be used as an identifier and to represent the track in a menu.

The @role attribute is optional and provides a description of the content that a track offers to the media resource. The following roles are pre-defined for now:

  • "caption",
  • "subtitle",
  • "textaudiodesc",
  • "karaoke",
  • "chapters",
  • "tickertext",
  • "lyrics".

The @role attribute has several purposes:

  • it puts tracks together in the same css styling class,
  • it provides a semantic hint for applications/UAs as to what is in the track. So, if e.g. a browser preference says "always automatically activate all subtitles in Swedish" then you can find the subtitle track in Swedish.

The @type attribute describes the text resource in terms of its MIME type, optionally with a charset parameter. The baseline media types to support are "text/srt" and "application/ttaf+xml" (see File Formats below).

The @media attribute provides a valid media query. A media query that evaluates to "false" means the track cannot be enabled because it is not appropriate for the user's environment.

The @language attribute, if present, gives the language of the linked resource. The value must be a valid RFC 3066 language code. [RFC3066]

An "enabled" attribute is used by the Web author during the markup expresses with the intention of the page author to have this track activated. The UA may overrule this intent where it has extra knowledge about user preferences, e.g. for selecting a more appropriate alternative out of a trackgroup.

If provided, the @enabled attribute specifies that the track is enabled and should be displayed in an appropriate manner. If the attribute is not provided, the track is not displayed.

The trackgroup element:

The <trackgroup> element is used as an optional element to group several <text> elements together. These can only be enabled mutually exclusively, similar to how a radiogroup works. It is possible that none of the tracks in a trackgroup are enabled.

Since often the tracks are grouped based on one of the tracks' attributes, in particular @role or @language, the trackgroup element inherits most of the attributes from the track element. These are also a means to avoid replicating the same attribute value across all the tracks in a trackgroup.

Elements inside the <trackgroup> element list alternative resources which are selected by the UA based on an algorithm.

   interface HTMLTrackGroupElement : HTMLElement {
          attribute DOMString role;
          attribute DOMString type;
          attribute DOMString name;
          attribute DOMString language;


The content of a fetched text resource is parsed into text pieces that are supposed to be displayed from a certain start time to a certain end time of the media element's timeline. The track element provides a <div>-like area into which the text pieces are rendered.

For video / audio this <div>-like area is typically by default the extent of the video or audio element (width / height). Audio with visible text must display controls and have a minimum height of 100px to allow rendering the text into.

If the text is rendered on top of the video, rendering engines must ensure that the rendered text does not collide with other elements rendered on top of the video, e.g. controls. Also, the text must be available to assistive technology when visible.

The styling of the text inside the <div>-like area is given a default styling, which will be over-written by styling provided by the text format, if available, which in turn can be overriden by specific styling by the Web page author.

Depending on the role, the default styling of the <div>-like area will be different:

caption, subtitle, lyrics, karaoke:
 color: white;
 background-color: #333333;
 text-align: center;
 bottom: 0;
textual audio descriptions:
 visibility: hidden; (unless this makes screen readers not read them out)
 aria-live: assertive;
 position: absolute;
 z-index: -100; (or more - shouldn't be visible)

Recommended user interface

  • UAs are recommended to add an icon to the controls bar of the video or audio element to indicate the existence of associated text. UA may display the available text associations through a menu where the resources are displayed. Where a track group is built, a sub-menu should be created with a radio button style selection.
  • UAs are also recommended to scale the display area and font size with the video, in particular when the video goes full-screen.

Resource selection algorithm

Only fetch a <track> element's resource:

  • if the resource is indicated to be in a format (@type attribute) that the UA knows how to parse, and
  • if the media query resolves for the given device, and
  • if the track is enabled.

For each track in a trackgroup

- the UA can decide to set the enabled attribute based on 'lang' and/or 'media' attributes on track elements.
- If there exists tracks with the enabled attribute:
     Select the first such track.
     Do not use any of the tracks

A track is enabled in one of the following cases:

  • if there is a @enable attribute on the track.

-> in a trackgroup only the first such track is enabled

  • if the track matches UA preferences (e.g. auto-enable all tracks with role "caption" and language "en").

-> only the first such track in a trackgroup is enabled

  • if it has been enabled through a JavaScript call such as: video.tracks[1].enabled = true;

-> if such a track is part of a trackgroup, this will disable any other enabled track in that trackgroup

File Formats

As with the video element, introducing externally associated resources require a choice to be made about the file formats that should be supported by default in a User Agent.

Requirements for such file formats have been collected in

A brief discussion at the TPAC in November 2009 seemed to indicate that the W3C Timed Text Format TTML should be the first choice. As an alternate, simple format the SubRip srt format in its simplest form should also be supported by browsers. Since srt can be regarded as a simple subpart of TTML, creating support for srt will be simple.

There is a path to implementation of TTML starting with the simplest profile, the transformation profile at .

For other types of associated text, file format discussions are necessary. For most, a transcoding to one of the two default formats should be possible without loss of information.


<video src="video.ogv">
  <track src="video_cc.dfxp" type="application/ttaf+xml" language="en" role="caption"></track>
<video src="video.ogv">
  <track src="" type="text/srt" language="en" role="textaudesc"></track>
<video src="video.ogv">
  <track src="video_cc.dfxp" type="application/ttaf+xml" language="en" role="caption"></track>
  <track src="" type="text/srt" language="en" role="textaudesc"></track>
  <trackgroup role="subtitle">
    <track src="" type="text/srt; charset='Windows-1252'" language="en"></track>
    <track src="" type="text/srt; charset='ISO-8859-1'" language="de"></track>
    <track src="" type="text/srt; charset='EUC-JP'" language="ja"></track>
<video controls>
  <source src='video.ogv' type='video/ogg'>
  <source src='video.mp4' type='video/mp4'>
  <track src="video_cc.dfxp" type="application/ttaf+xml" language="en" role="caption"></track>
  <track src="" type="text/srt" language="en" role="textaudesc"></track>
  <track src="" type="text/srt; charset='Windows-1252'" language="en" role="subtitle"></track>


Positive Impact

  • a standard means of associating external caption, subtitle, and textual audio description resources is introduced, that will stop the creation of a large number of home-made JavaScript solutions to the same issue
  • if Web tools can rely on associated text resources being referenced in this way, automated processing tools will be enabled, such as e.g. search engine access to captions
  • two formats are proposed to be supported, one of which is trivially simple and the other has all the bells and whistles required for high-quality caption and subtitles

Negative Impact

  • two baseline formats are never better than one, but srt is a trivial subset of dfxp, such that this should not turn into an issue