W3C logo WAI logo

SMIL accessibility recommendations

Notes in draft:
This draft is acknowledged to be incomplete in the sense that it needs more information regarding at least
  • User control of the play process, such as being able to freeze play to allow extra time for reading
  • Adaptation to accomodate motor and cognitive disabilities


transcript and script
the text of spoken information in a video. If two or more people are speaking the transcript or script identifies the speaker. Similar to the notation used for a play.
Phrase by phrase display of the script in conjunction with a movie or other video content.
Captioning is visually similar to subtitles but it goes beyond the text of the script to provide a richer description including speech, emotional tone, speaker identification, and sounds. The captioning may include graphic, and/or video captions as well as text. Sounds that are represented in captions but not in subtitles include things like phone rings, knock, song lyrics or description of music. Captions are used by deaf and hard of hearing people, by those learning to read, by those learning a new language, and by those in noisy environments or place where sound will disturb others.
long description
providing a textual account of the important information in a image within the context under which the image appears. The same image can have different meanings within different contexts.
alt text
The text value of the "alt" attribute for an image, object, or other non-textual element in an  HTML document. For user agents that cannot display images, forms, or applets, this attribute specifies alternate text.
Video description
an verbal account of the visual information presented in a video or animation that coordinates with or supplements other audio information being presented. For real-time descriptive narration added to live material, the term "audio description" is generally used.


Many people will want the transcript or the script of a multimedia presentation. You can order these today on many TV and radio broadcasts. The transcript can serve as the basis for captioning and description, although additional information is necessary. As noted above, captions for deaf and hard of hearing people must include information such as "phone rings" and "angry shouting". The traditional transcript concept can be expanded to meet these requirements.

It should be possible to include the transcript of a video in HTML. A caption authoring tool could be used to "span" or "div" portions of the transcript and associate this with specific times in the video or audio presentation through the SMIL file. Additional information can be placed in the transcript file and identified with a "class" attribute of "captioning" to distinguish this information from the transcript. When the multimedia presentation is shown, those wanting to see captions would see both the timed transcript elements as well as those identified with the class attribute of captioning.

There may be different language translations of the transcript and the associated captioning information. The selection of the different languages would take place through the standard mechanisms for selecting different languages.

1. Transcript Review

After experiencing a SMIL presentation, the person wants to go back and review the transcript. The person would "make visible" or "turn on" the transcription option. This would launch a HTML presentation of the transcript. The default presentation would show the text and scroll as the video and audio are played. The implementer may elect to position the currently spoken element in the middle of the screen so the user would see the previous and next elements and give the person the context of the transcript. The elements marked with the class attribute of captioning would be hidden through the application of a style sheet. Of course, the user could turn this on by the application of a style making this visible. This is standard HTML browser behavior.

2. Digital Talking Book

This is very straight forward. There is a one to one relationship between the HTML and the spoken audio and the person uses the HTML for navigation and launches the audio portion at any location in the HTML. Elements may be highlighted, and the person may elect for highlighting of individual words if that is available. This is being defined by the DAISY Consortium.

3. Alt Text, Long Descriptions, Descriptive Video

For persons who are blind and print disabled, it must be possible to hear alt text announced and then hear long descriptions, if available, read as well. This is more of an issue of pausing and resuming. It should be possible to automatically pause with new images. Descriptive video seems to be no problem; it is a audio component that parallels the video and is turned on at the option of the user. It is interesting to point out that in today's descriptive video, the description must fit into the time frame of the video. Using SMIL this could be done, but it is also possible to pause the video until the d escriptive video portion is completed. This requires no modification to SMIL, but it is a benefit of the specification.

4. Captioning

It seems that a descriptor attribute of "captioning" is needed. The player would have different behaviors from standard HTML browsers. Each element addressed through an ID in the HTML transcription / captioning file would be displayed in a captioning channel, if made visible through an option of the user. All HTML features included variable fonts, alignment, colors, images, and links would be supported.

When the captioning option is turned on, there should be several options for presenting the captioning. Each element in the transcript and captioning file is associated with a duration in the audio component. There is a possible conflict between the duration of the audio and the time needed to read the captioning. These conflicts lead us to the following options.

4.1 variable reading rate

The captioning player should have a variable reading rate control. This puts the text up at a certain rate set by the user. This implies that one of the following things happen to the audio and associated video: a) the audio and video fills and freezes on the last frame while the text is being scrolled, b)audio and video goes to slow motion. NOTE: the audio distorts in slow down beyond a certain point. It is probably advisable to slow down to a point and then fill and freeze on the last frame.

4.2 scrolling and pop-on captions

Two styles of captions must be supported--scrolling and pop up. Scrolling (or roll-up) captions are like those shown on TV news programs and other real-time events. The caption scrolls in a caption window. The user may control the amount of the caption seen at one time. Pop-on captions are like those used on many pre-recorded TV programs. Each caption pops onto the screen at a specific time.

4.3 auto-fit captions

For pop-on captions, the caption window must have an auto-fit property that causes the window to expand when the caption doesn't fit into the allotted space. Such a feature is critical for low vision users who may apply a style sheet that increases font size. A scroll bar solution to this problem is not workable because, with timed media, users would not have sufficient time to use them. In the event that the expanded caption window overlaps other screen elements, it should be possible to move the window or revert to the transcript option with the captions in a separate html presentation.

4.4 caption window placement features

Options for placement of the caption window must be supported. At a minimum, a layering and static window option are needed. Layering of the caption window on top of the video must be possible for cases where captions are added to an existing application. The preferred style and recommendation should be reserving a space under any timed visual media for a caption window. The ability to set coordinates for the caption window for each caption would permit movement of the caption window during the SMIL presentation. Variable placement of the caption window (and multiple caption windows) are supported on television today. The caption window should also have a margin or border feature so that text will not touch the sides of the window.


Copyright ©  1997 W3C (MIT, INRIA, Keio), All Rights Reserved. W3C liability, trademark, document use and software licensing rules apply. Your interactions with this site are in accordance with our public and Member privacy statements.