Many people will want the transcript or the script of a multimedia presentation. You can order these today on many TV and radio broadcasts. The transcript can serve as the basis for captioning and description, although additional information is necessary. As noted above, captions for deaf and hard of hearing people must include information such as "phone rings" and "angry shouting". The traditional transcript concept can be expanded to meet these requirements.
It should be possible to include the transcript of a video in HTML. A caption authoring tool could be used to "span" or "div" portions of the transcript and associate this with specific times in the video or audio presentation through the SMIL file. Additional information can be placed in the transcript file and identified with a "class" attribute of "captioning" to distinguish this information from the transcript. When the multimedia presentation is shown, those wanting to see captions would see both the timed transcript elements as well as those identified with the class attribute of captioning.
There may be different language translations of the transcript and the associated captioning information. The selection of the different languages would take place through the standard mechanisms for selecting different languages.
After experiencing a SMIL presentation, the person wants to go back and review the transcript. The person would "make visible" or "turn on" the transcription option. This would launch a HTML presentation of the transcript. The default presentation would show the text and scroll as the video and audio are played. The implementer may elect to position the currently spoken element in the middle of the screen so the user would see the previous and next elements and give the person the context of the transcript. The elements marked with the class attribute of captioning would be hidden through the application of a style sheet. Of course, the user could turn this on by the application of a style making this visible. This is standard HTML browser behavior.
This is very straight forward. There is a one to one relationship between the HTML and the spoken audio and the person uses the HTML for navigation and launches the audio portion at any location in the HTML. Elements may be highlighted, and the person may elect for highlighting of individual words if that is available. This is being defined by the DAISY Consortium.
For persons who are blind and print disabled, it must be possible to hear alt text announced and then hear long descriptions, if available, read as well. This is more of an issue of pausing and resuming. It should be possible to automatically pause with new images. Descriptive video seems to be no problem; it is a audio component that parallels the video and is turned on at the option of the user. It is interesting to point out that in today's descriptive video, the description must fit into the time frame of the video. Using SMIL this could be done, but it is also possible to pause the video until the d escriptive video portion is completed. This requires no modification to SMIL, but it is a benefit of the specification.
It seems that a descriptor attribute of "captioning" is needed. The player would have different behaviors from standard HTML browsers. Each element addressed through an ID in the HTML transcription / captioning file would be displayed in a captioning channel, if made visible through an option of the user. All HTML features included variable fonts, alignment, colors, images, and links would be supported.
When the captioning option is turned on, there should be several options for presenting the captioning. Each element in the transcript and captioning file is associated with a duration in the audio component. There is a possible conflict between the duration of the audio and the time needed to read the captioning. These conflicts lead us to the following options.
The captioning player should have a variable reading rate control. This puts the text up at a certain rate set by the user. This implies that one of the following things happen to the audio and associated video: a) the audio and video fills and freezes on the last frame while the text is being scrolled, b)audio and video goes to slow motion. NOTE: the audio distorts in slow down beyond a certain point. It is probably advisable to slow down to a point and then fill and freeze on the last frame.
Two styles of captions must be supported--scrolling and pop up. Scrolling (or roll-up) captions are like those shown on TV news programs and other real-time events. The caption scrolls in a caption window. The user may control the amount of the caption seen at one time. Pop-on captions are like those used on many pre-recorded TV programs. Each caption pops onto the screen at a specific time.
Options for placement of the caption window must be supported. At a minimum, a layering and static window option are needed. Layering of the caption window on top of the video must be possible for cases where captions are added to an existing application. The preferred style and recommendation should be reserving a space under any timed visual media for a caption window. The ability to set coordinates for the caption window for each caption would permit movement of the caption window during the SMIL presentation. Variable placement of the caption window (and multiple caption windows) are supported on television today. The caption window should also have a margin or border feature so that text will not touch the sides of the window.