From Web Media Text Tracks Community Group
An abstract data model of Captions
This document is trying to capture all existing uses of captioning and subtitles so as to be able to understand what features we require from a timed text format on the Web and more generically. The model should cover both, existing linear broadcast uses as well as existing online uses - and all of this on an international basis.
Note: The current descriptions are strongly oriented on the US CEA-608 and CEA-708 caption models. Internationalisation requirements have been added. It is not complete at this point in time, so please contribute, in particular if your experience differs from descriptions provided here. Check your favorite caption/subtitle format and see if its capabilities are covered in this document.
1. Caption Cue
A caption is a piece of content that is related to an interval in a video's timeline and thus has a start and and end time. For purposes of generalisation (i.e. use for subtitles, descriptions, chapters etc) we also call such a timed piece of content a "cue". This document focuses only on the requirements of subtitle and caption cues.
2. Cue Content
Caption and subtitle cues consist of text or raster images of text. They also sometimes contain graphical items such as icons or logos. For our purposes here we will assume we only deal with text content in cues.
Cue text is either provided as horizontally or vertically rendered text. The text is internationalized, so it can be right-to-left or left-to-right displayed text.
NOTE: It is unclear whether rtl text when rendered horizontally should grow from the bottom as the picture shows. Any i18n experts around?
REPLY from i18n expert: rtl always grows top to bottom, never bottom to top
3. Cue Text Lines and Block
A cue's content can consist of one or more lines of text, typically no more than four lines.
For horizontal text, several lines of text are displayed as a block with lines added below or above the first line.
For vertical text, several lines of text are displayed as a block with lines added either to the left or to the right of the first line.
The images under section 2 show cue text block examples with their lines in yellow. Note that for readability reasons the boxes have been made the same size, when in fact they should just be the bounding box around the text lines.
4. Caption Text Block Display
A caption cue's content may be displayed all in one go (that is called "pop-on captions") or as successive characters, words or lines of content (successive characters or words is called "paint-on captions" and successive lines of content is called "roll-up captions").
"Pop-on" captions are displayed in their determined location and aren't typically moved any more before their end time at which they are removed from display.
When characters are displayed successively ("painted-on"), the characters - once displayed - should not change screen position any more while characters are painted-on within the same line. This means that the positions of paint-on characters have to be calculated for the line before the characters are drawn. The same applies to words.
When successive lines of content are drawn ("roll-up captions"), a newly added line may be added in the position of an already presented line and push that line up (or down). This is typically used for live captioning.
5. Caption Rendering
Captions are designed to be rendered on top of a video's visual display, also called the "video viewport". Their positioning is by default given in relation to a video's width and height.
We call the vertical positioning "line position", since it can be given in relation to the number of lines of text from the top of the video viewport or in percent of the video viewport's height.
We call the horizontal positioning "text position". For rtl-rendered text, horizontal positioning is calculated from the right edge of the video.
For vertically rendered text, the vertical and horizontal directions are exchanged. Thus, line positions are calculated from the right edge of the video (when text grows rtl) or the left edge of the video (when text grows ltr). Text positions are calculated from the top (for ltr-rendered text) or bottom (for rtl-rendered text).
6. Caption Rendering Box
Captions' rendering areas are given by default as a one line "high" rendering box with the "width" calculated from the length of the text to be rendered in it. If the text is vertical, then the rendering box's "height" is calculated from the length of the text to be rendered in it and it is one line "wide". The height will be increased when the caption text contains multiple lines.
Section 5 has an example of a caption rendering box.
7. Fixed line length of Rendering Box
To allow specifying a more restricted rendering area and fixed alignment positions, the line length of a Rendering Box may be fixed in width (in percent of the video viewport's width) for horizontally rendered text, or fixed height (in percent of the video viewport's height) for vertically rendered text. Text that does not fit within the line length is wrapped.
Wrapping can be caused by the viewer, too, when the viewer e.g. changes the font size or the video viewport size.
8. Text alignment
Caption text lines in the multi-line or line-wrapped case may be aligned within the caption rendering box to the "start", "middle" or "end" of the line. "start" means "left" for left-to-right displaying text and "right" for right-to-left displaying text and "top" for vertically rendered text. "end" has according meanings. In addition, "right" and "left" alignment is useful, too, when there is a specific alignment of the text required no matter whether it' right-to-left or left-to-right displaying text.
For vertical text, "start" alignment maps to "top" alignment, "middle" is centered between top and bottom, and "end" alignment maps to "bottom".
9. Anchoring of the Rendering Box
Any point inside a rendering box may be anchored on the video viewport, such that an added line of text will cause the box to grow equally from that anchor point. For example, in CEA-708, nine anchor points are used: the four corners, the four half-way marks along the sides of the rendering box, and the center of the box. In theory, however, any point can be used.
The growing direction of the Anchored Rendering Box depends on how the anchor point is specified from top and left. For example, a box anchored with a top of 0 and left of 50% offset will grow towards the bottom for horizontal text and equally towards the left and right for vertical text.
The caption content is displayed into this box either as a block of text in one go ("pop-on captions") or line-by-line (for successively added content). This means that when a new line of text is displayed, the previously rendered line(s) of text in the cue may change its place depending on where the rendering box is anchored.
10. Temporally overlapping captions
Temporally overlapping caption cues (i.e cues whose timeline intervals overlap) may be rendered into a new rendering box or added to the an existing rendering box. A new rendering box is created when new rendering locations (line/text position) are provided, otherwise the text is added to an existing rendering box. In the latter case the text lines are treated like added lines of content and thus influence the size of the rendering box and will likely change the positioning of the lines of text that had previously been rendered into this box. The latter case is therefore no different to a roll-up captions.
[Q: do we want to allow new lines of overlapping captions to be added in a different growing direction to how the text withing a caption grows? For example, horizontal captions with multiple lines of text are rendered top-to-bottom. An overlapping caption may be added below or above that text. This could result in very confusing displays if multi-line cues and overlapping cues don't render the same way. Maybe we should instead allow to define only the direction in which new lines are added to the display, see point 4?]