RollupCaptions

From Web Media Text Tracks Community Group

Rollup Captions

The User Need

Traditional TV captions have three different modes (see: § 15.119 in Code of Federal Regulations for Closed caption decoder requirements for analog TV receivers):

  • pop-on
  • paint-on
  • roll-up

WebVTT as currently specified supports pop-on captions (i.e. a cue's text content is fully rendered in one go and doesn't move until its end) and paint-on captions (i.e. a cue's text content is successively rendered but once rendered doesn't move until its end). However, there is no simple way to do roll-up captions.


What is roll-up used for?

Roll-up captions are typically used for live captioned content. There are a limited number of lines of text displayed (2, 3 or 4) and every time a new line is added, the previous ones are moved up until they disappear from the caption window.

Users are typically used to the fact that live captions are not of as high quality as pop-on captions. Often there are typos and misspellings of misunderstood words. But since it's clear that these are live captions, users know to tolerate this lower quality.

Roll-up captions are also used for news broadcasts and other heavily speech-focused programs. With roll-up, captions are displayed a word at a time and continuously grow, which matches what speakers do. If synchronized exactly, users can immediately read the word that is being spoken without a lag and thus keep much better track of what is being said. This also helps for language learning as well as for deaf people trying to follow a conversation.

In addition, since lines are kept on screen longer than for typical pop-on captions, the reader has more time to capture the conversation, in particular if a real-time captioner has made a mistake and provides a correction in the next line.


Existing content

A lot of content exists with roll-up captions and new content is continually being created - sometimes even for non-live content. A first use case for supporting roll-up captions in WebVTT is therefore to support old content as it is being rendered on the Web with the exact same rendering as previously on TV (which is a requirement that many content owners put on content publishers).


Automated Speech Recognition

A second use case for roll-up captions on the Web are captions that are created using automated speech recognition. It's important that users can distinguish easily whether captions are created with lots of care and are supposed to be of high quality, or are provided with some restrictions and are thus best effort. Users are more tolerant towards mistakes if they are aware that it's best effort.

Display example: http://www.youtube.com/watch?v=iY-_tO_L3VQ#t=2842s

User need

There have been arguments that roll-up captions are generally not as high in quality as pop-on captions and therefor the users should be protected from such content by not even supporting this feature.

It is indeed true that roll-up captions are of lesser quality than pop-on captions, since they are typically created either via live captioning or with ASR. Studies have shown that aside from typos and wrong words, the main objection to scrolling subtitles is the delay with which the live subtitles are presented (see report). Other complaints refer to not being able to see the speakers’ faces to lip read what they say, and excessive editing. Finally, tests also showed that roll-up subtitles require more eye fixations on the text than block subtitles.

No matter their poor quality, studies surprisingly also found that users are actually split on their preference as to how they want live subtitles to be displayed: half of them actually prefer the roll-up display and half pop-on. Therefore, there is a user requirement to continue supporting roll-up caption modes.

Users should at least have the opportunity to provide a preference as to how they want their captions displayed. Such a preference setting is currently not possible with WebVTT, which will never move cue text, but instead place new cue text lines either on top of already rendered text lines or fill a line below if it has become empty.


Author need

Publishers have voiced a need to present time-overlapping cues in roll-up mode as well as the current rendering approach where cue text is placed on top or below existing rendered text wherever there is space. Therefore, we need a means for authors to choose a rendering means for time-overlapping cues. This cannot just be a user preference.

Technical Requirements

Traditionally, rollup is a sequence of text lines that are painted to screen successively and every new line is added in the line position of the currently bottom-most line and pushes all the other lines up one, the top most disappearing when it reaches the maximum line count. The words within each line may appear successively, too.

Perceptively, rollup captions are individual cues, each being on a single line (possibly with paint-on text on the line) and having a start time and a display duration during which it is possible to be moved up one line at a time from its initial rendering position.

Rollup Examples on YouTube with CSS rendering:

Rollup Examples from TV:

Rollup in use for Karaoke:"

Assessment criteria for a specification:

  • Copying of text between cues is not acceptable, since it will confuse any automation system, including screenreaders and search engines.
  • Preferably the text lines will always start at the same position even when moved up.
  • Changes to the existing WebVTT format should be minimal and backwards compatible.
  • Preferably it should be possible to just change the number of lines to display and cues will stay on longer.
  • The transition of text from one line to the next should preferably be style-able with CSS transitions. For example, one could attach a "transition: all 3s ease-in;" property to the cue, such that any CSS changes (including a change in bottom/top/left/right positioning) would be executed in a smooth 3s long transition.

Existing Formats

Using the first sentence of this YouTube video as an example: http://www.youtube.com/watch?v=bgG1CsETodg

1. YouTube proprietary XML format

The YouTube markup contains the following:

<timedtext>
  <window t="0" id="1" op="define" rc="15" cc="32" ap="7" ah="50" av="95"/>
  <window id="2" t="5671" op="define" ap="7" ah="50" av="95" rc="4" cc="32" sd="1" ju="0"/>
  <text w="2" t="5738" append="1"></text>
  <text w="2" t="5938" d="1268">WH</text>
  <text w="2" t="6005" d="1201" append="1">EN</text>
  <text w="2" t="6071" d="1135" append="1">I</text>
  <text w="2" t="6138" d="1068" append="1">G</text>
  <text w="2" t="6205" d="1001" append="1">ET</text>
  <text w="2" t="6272" d="934" append="1">A</text>
  <text w="2" t="6338" d="868" append="1">S</text>
  <text w="2" t="6405" d="801" append="1">IC</text>
  <text w="2" t="6472" d="734" append="1">K</text>
  <text w="2" t="6506" d="700" append="1">BI</text>
  <text w="2" t="6572" d="634" append="1">RD</text>
  <text w="2" t="6639" d="567" append="1">,</text>
  <text w="2" t="7140" append="1"></text>
  <text w="2" t="7206" append="1"></text>
  <text w="2" t="7407" d="1801">TH</text>
  <text w="2" t="7473" d="1735" append="1">AT</text>
  <text w="2" t="7540" d="1668" append="1">J</text>
  <text w="2" t="7607" d="1601" append="1">US</text>
  <text w="2" t="7640" d="1568" append="1">T</text>
  <text w="2" t="7706" d="1502" append="1">ST</text>
  <text w="2" t="7773" d="1435" append="1">OP</text>
  <text w="2" t="7840" d="1368" append="1">S</text>
  <text w="2" t="7906" d="1302" append="1">EV</text>
  <text w="2" t="7973" d="1235" append="1">ER</text>
  <text w="2" t="8040" d="1168" append="1">YT</text>
  <text w="2" t="8107" d="1101" append="1">HI</text>
  <text w="2" t="8173" d="1035" append="1">NG</text>
  <text w="2" t="9141" append="1"></text>
  <text w="2" t="9208" append="1"></text>
  <text w="2" t="9408" d="1002">FR</text>
  <text w="2" t="9475" d="935" append="1">OM</text>
  <text w="2" t="9509" d="901" append="1">M</text>
  <text w="2" t="9575" d="835" append="1">OV</text>
  <text w="2" t="9642" d="768" append="1">IN</text>
  <text w="2" t="9709" d="701" append="1">G</text>
  <text w="2" t="9775" d="635" append="1">FR</text>
  <text w="2" t="9842" d="568" append="1">OM</text>
  <text w="2" t="9909" d="501" append="1">M</text>
  <text w="2" t="9976" d="434" append="1">Y</text>
  <text w="2" t="10042" d="368" append="1">PL</text>
  <text w="2" t="10109" d="301" append="1">AC</text>
  <text w="2" t="10176" d="234" append="1">E</text>
  <text w="2" t="10343" append="1"></text>
  <text w="2" t="10410" append="1"></text>
  <text w="2" t="10610" d="1100">TO</text>
  <text w="2" t="10677" d="1033" append="1">A</text>
  <text w="2" t="10743" d="967" append="1">NY</text>
  <text w="2" t="10810" d="900" append="1">WH</text>
  <text w="2" t="10877" d="833" append="1">ER</text>
  <text w="2" t="10944" d="766" append="1">E</text>
  <text w="2" t="11010" d="700" append="1">EL</text>
  <text w="2" t="11077" d="633" append="1">SE</text>
  <text w="2" t="11144" d="566" append="1">.</text>
  ...
</timedtext>

The attributes on the <text> element provide the following information:

  • w = reference to the id of a <window> element into which the text is rendered
  • t = the time stamp at which to start displaying it
  • d = the duration for which it is displayed
  • append = whether the text is to be appended the text or start a new line

The <window> element defines an area on the video into which text blocks are rendered.


2. SCC format

Scenarist Closed Captions are the common way in which CEA608 captions are encoded into files. There is a binary format and a "disassembly" format that has actually readable characters in them. We'll look at the disassembly format here. To make it simpler, we're displaying one line at a time rather than in character blocks.

SCC_disassembly V1.2
CHANNEL 1

00:00:05:94     {RU3}{CR}{1504}WHEN I GET A SICK BIRD,
00:00:07:04     {RU3}{CR}{1504}THAT JUST STOPS EVERYTHING
00:00:09:41     {RU3}{CR}{1504}FROM MOVING FROM MY PLACE_
00:00:10:61     {RU3}{CR}{1504}TO ANYWHERE ELSE.
...

In SCC, each pair of bytes (i.e. characters) occupies one video frame's worth of time, which makes the text rendering incremental within a line.

The different control characters mean the following:

  • {RU3} = roll-up with three lines visible in rows 13 through 15 (note that {RU2} if for rows 14+15, and {RU4} for rows 12-15)
  • {CR} = introduces a newline that scrolls existing text up (deleting the top row once it scrolls above the line limit)
  • {1504} = addresses row 15, column 04, with plain white text


Proposed Specifications

1. One big cue

Proposal:

The caption text for the whole video is regarded as one big cue and the lines are similarly timed with WebVTT cue timestamps.

Example:

  WEBVTT
 
  00:00:05.940 --> xxxx rollup:3
  WHEN I GET A SICK BIRD,
  <00:00:07.040> THAT JUST STOPS EVERYTHING
  <00:00:09.410> FROM MOVING FROM MY PLACE
  <00:00:10.610> TO ANYWHERE ELSE.
  ...

Discussion:

  • (-) It is unclear what end time would be appropriate for the cue, in particular on a live stream.
  • (-) It needs an extra cue setting specifying how many lines of rollup to display.
  • (-) The rollup display would also be different from the normal wayof dealing with cue timestamps, which essentially just reveal text/text lines rather than reposition them.
  • (-) The whole cue needs to be stored in memory after parsing the cue.
  • (-) It would need special rules to encapsulate into media files in a time-sliced manner.
  • (+) It's very simple to do and would work well with live data, except for the end time problem.
  • (+) CSS transition could be applied to the full cue.

2. No change

Proposal:

We decide not to introduce any new features for rollup, but make do with existing markup. This means that people have to copy a line when it has to move.

Example:

  WEBVTT
 
  00:00:05.940 --> 00:00:07.040
  WHEN I GET A SICK BIRD,

  00:00:07.040 --> 00:00:09.410
  WHEN I GET A SICK BIRD,
  THAT JUST STOPS EVERYTHING

  00:00:09.410 --> 00:00:10.610
  WHEN I GET A SICK BIRD,
  THAT JUST STOPS EVERYTHING
  FROM MOVING FROM MY PLACE

  00:00:10.610 --> 00:00:11.700
  THAT JUST STOPS EVERYTHING
  FROM MOVING FROM MY PLACE
  TO ANYWHERE ELSE.

  ...


Discussion:

  • (-) There is unnecessary duplication of the text requiring extra work by authoring tools, conversion tools, search engines, text analysis tools, the rendering engine, and screen readers and braille readers.
  • (-) All the markup and cue settings on the text has to be repeated in every single cue, too.
  • (-) introduces cue-to-cue dependency that is hard to handle in particular when seeking or streaming.
  • (-) Makes file 3 times as big as necessary (with a 3-line rollup).
  • (-) It's not possible to address with a single CSS statement all the identical lines.
  • (-) Transitions cannot be done in CSS using pseudo-selectors for the cues, since the cue text doesn't change, but a whole new cue is produced.
  • (-) If every line has different cue settings, we have even more duplication.
  • (+) No cue-to-cue dependency.
  • (+) The number of lines of rollup is flexible.
  • (+) It's very simple to do and would work without additional features.

Improvement:

If we put identifiers on lines, then we can identify repetitions and introduce smooth scrolling react to duplication. However, that's quite a substantial amount of calculation to expect.

Example:

  WEBVTT
 
  00:00:05.940 --> 00:00:07.040
  <cue-id id=1>WHEN I GET A SICK BIRD,

  00:00:07.040 --> 00:00:09.410
  <cue-id id=1>WHEN I GET A SICK BIRD,
  <cue-id id=2>THAT JUST STOPS EVERYTHING

  00:00:09.410 --> 00:00:10.610
  <cue-id id=1>WHEN I GET A SICK BIRD,
  <cue-id id=2>THAT JUST STOPS EVERYTHING
  <cue-id id=3>FROM MOVING FROM MY PLACE

  00:00:10.610 --> 00:00:11.700
  <cue-id id=2>THAT JUST STOPS EVERYTHING
  <cue-id id=3>FROM MOVING FROM MY PLACE
  <cue-id id=4>TO ANYWHERE ELSE.

  ...

CSS could be:

::cue(cue-id#1) {
 top: 85%;
 transition: top .2s linear;
}

::cue(cue-id#1):repeated(1) {
 top: 80%;
 transition: top 0.2s linear;
}

::cue(cue-id#1):repeated(2) {
 top: 75%;
 transition: top 0.2s linear;
}

Discussion:

  • (+) it is possible to identify the same line across multiple cues
  • (-) still has basically all the problems of the previous discussion

3. Cue classes

Analysis:

The problem is that we don't currently represent the concept of cue text rendering boxes that persist over time. These are called "windows" in CEA608/708, represent a specific anchoring position and rendering direction, and are used to add actual cue text to them.

Proposal:

By grouping cues together as a "continuation", we can identify them as belonging together to be rendered into the same rendering box. This can be achieved by introducing a "class" attribute on cues. Since CSS already has classes as a grouping mechanism for different rendering areas on the page,this extends this concept to the time dimension also. Thus, this allows grouping of cues that belong together as a "continuation" of each other.

Example:

  WEBVTT
 
  1.rollupBox
  00:00:05.940 --> 00:00:10.610
  WHEN I GET A SICK BIRD,

  2.rollupBox
  00:00:07.040 --> 00:00:11.700
  THAT JUST STOPS EVERYTHING

  3.rollupBox
  00:00:09.410 --> 00:00:12.910
  FROM MOVING FROM MY PLACE

  4.rollupBox
  00:00:10.610 --> 00:00:14.100
  TO ANYWHERE ELSE.

  ...

CSS could look like:

::cue(.rollupBox) {
  transition-property: position;  
  transition-duration: 1.5s;
  transition-timing-function: linear;
}

The text that is added to a cue of the same class is added below (which is the normal scrolling behaviour of text).

Discussion:

  • (+) text is not repeated
  • (+) markup of text is not repeated
  • (+) text can be incrementally supplied (e.g. in a live scenario)
  • (+) we can address the cues that belong into the same rendering box through one CSS statement: e.g. ::cue(.rollupBox)
  • (+) only a cue with a new class (or no class) creates a new rendering area ("div")
  • (+) the rendering continuation between cues can be upheld even when there are several other cues in the middle that don't belong to the same continuation
  • (+) the rendering continuation between cues can be upheld even if the rendering area's cue settings change (e.g. if the rollup has to move from the bottom to the top of the viewport because there is some burnt-in text visible at the bottom of the screen that would be obstructed by the caption text)
  • (-) the number of rollup lines cannot be changed easily because every single rollup cue has a fixed specification of its end time
  • (-) a cue can only ever belong to a single rollup rendering box, which is different to how classes are normally handled in CSS
  • (+) it's simple
  • (+) in clients that don't support it, it just degrades to a non-rollup display mode.


Extension possibility:

If we wanted to allow both scrolling up and scrolling down (e.g. on a rendering box that is top-aligned on the browser and the text grows down), we could introduce a "move" attribute which includes the scrolling direction of a cue, e.g. move:up or move:down. This then implies where the next cue is allowed to push this cue to.

4. Cue setting: grouping cues

Proposal:

Similar to the "cue classes" proposal, we could introduce another cue setting as a way to group cues together.


Example:

  WEBVTT
 
  00:00:05.940 --> 00:00:10.610 rollup:box1
  WHEN I GET A SICK BIRD,

  00:00:07.040 --> 00:00:11.700 rollup:box1
  THAT JUST STOPS EVERYTHING

  00:00:09.410 --> 00:00:12.910 rollup:box1
  FROM MOVING FROM MY PLACE

  00:00:10.610 --> 00:00:14.100 rollup:box1
  TO ANYWHERE ELSE.

  ...

The text that is added to a cue of the same rollup setting value is added below the existing text in that rollup rendering box. In the ::cue() selector matching, the "Lists of WebVTT Node Objects" can have classes defined by the "rollup" setting (only a single class per cue).

Discussion:

  • (+) similar to "cue classes", text is not repeated
  • (+) similar to "cue classes", markup of text is not repeated
  • (+) text can be incrementally supplied (e.g. in a live scenario)
  • (+) we can address the cues that belong into the same rendering box through one CSS statement: e.g. ::cue(rollup=box1)
  • (+) only a cue with a new rollup attribute value (or no such value) creates a new rendering box
  • (+) the rendering continuation between cues can be upheld even when there are several other cues in the middle that don't belong to the same continuation
  • (+) the rendering continuation between cues can be upheld even if the rendering area's cue settings change (e.g. if the rollup has to move from the bottom to the top of the viewport because there is some burnt-in text visible at the bottom of the screen that would be obstructed by the caption text)
  • (-) the number of rollup lines cannot be changed easily because every single rollup cue has a fixed specification of its end time
  • (+) it's simple
  • (+) in clients that don't support it - it just degrades to a non-rollup display mode.

5. Cue setting: Explicit Movement Indications

Analysis:

Since it's the arrival of new text that might cause scrolling, we can say what happens to text that's already in the place the new text is marked to go (i.e., move it). This leaves to VTT the semantics of what goes on which line, and to presentation software (possibly assisted by CSS) the decision over how nicely to do it (e.g. by smooth or jump scrolling).

Proposal:

Add to the Line indicator a trailing plus or minus sign to indicate what happens to existing content on the same line: it moves to the adjacent line with higher (plus) or lower (minus) number. If that line also has content, it, in turn, moves, until an empty line is found. When cue durations expire, no automatic movement (e.g. back down) is done; if that is desired, an explicit cue needs to be inserted.

Example:

  WEBVTT
 
  1
  00:00:05.940 --> 00:00:10.610 Line:-1
  WHEN I GET A SICK BIRD,

  2
  00:00:07.040 --> 00:00:11.700 Line:-1-
  THAT JUST STOPS EVERYTHING

  3
  00:00:09.410 --> 00:00:12.910 Line:-1-
  FROM MOVING FROM MY PLACE

  4
  00:00:10.610 --> 00:00:14.100 Line:-1-
  TO ANYWHERE ELSE.

  ...


Discussion:

  • (+) text is not repeated
  • (+) markup of text is not repeated
  • (+) text can be incrementally supplied (e.g. in a live scenario)
  • (-) cues might get scrolled off the screen while still being active
  • (+) the movement will only be applied to cues in the same location, so it is possible to have other cues temporally in the middle that are rendered elsewhere on screen and are not moved
  • (-) we can't address the cues that belong into the same rendering location through one CSS statement
  • (-) it is not possible to move a block of text somewhere else on screen e.g. if it has to move from the bottom to the top of the viewport because there is some burnt-in text visible at the bottom of the screen that would be obstructed by the caption text
  • (+) we can explicitly state what movement the new text creates: up (+) or down (-)
  • (-) the number of rollup lines cannot be changed easily because every single rollup cue has a fixed specification of its end time
  • (+) it is fairly simple
  • (+) in clients that don't support it - it just degrades to a non-rollup display mode.

Extension possibility:

We can probably make this work for ticker-tape-like horizontal actions as well; text that goes on the same line, and is right-aligned with a 'move left' (minus sign) indication or left aligned with a 'move right' (plus sign) indication.

6. Cue settings: introduce transitions

Analysis:

What we really want is for already rendered cue text to transition to some other position without repeating cues.


Proposal:

Introduce cue settings that allow for a transition to be specified and that has a direct mapping into CSS transitions.

Example:

  WEBVTT
 
  1
  00:00:05.940 --> 00:00:10.610 Line:80% Delay:2.1 Duration:1.5 Transition:linear Line:70% Delay:2.4 Duration:1.5 Transition:linear Line:60%
  WHEN I GET A SICK BIRD,

  2
  00:00:07.040 --> 00:00:11.700 Line:80% Delay:2.4 Duration:1.5 Transition:linear Line:70% Delay:1.2 Duration:1.5 Transition:linear Line:60%
  THAT JUST STOPS EVERYTHING

  3
  00:00:09.410 --> 00:00:12.910 Line:80% Delay:1.2 Duration:1.5 Transition:linear Line:70%
  FROM MOVING FROM MY PLACE

  4
  00:00:10.610 --> 00:00:14.100
  TO ANYWHERE ELSE.

  ...


This shows the first cue at line position 80%, then moves up to 70% as the next cue arrives, then move up to 60% as the next cue arrives etc.

Discussion:

  • (+) text is not repeated
  • (+) markup of text is not repeated
  • (+) text can be incrementally supplied (e.g. in a live scenario)
  • (+) the movement will be applied independently of other cues, e.g. we could have a single sign translation cue follow a sign as it moves across the screen
  • (-) we can't address the cues that belong into the same rendering location through one CSS statement, since there is no concept of a shared rendering location
  • (+) we can explicitly state where things are moved and exactly how
  • (-) the number of rollup lines cannot be changed easily because every single rollup cue has a fixed specification of its rendering position and transitions
  • (-) it is fairly complicated and can result in lengthy cue settings if several movement cycles re required.

7. Rendering preferences

Proposal:

Since the rendering of cues in roll-up vs pop-on style seems to be a preference by users, we could just make it a browser preference setting rather than baking it into the markup. It seems more useful to mark up the semantic ("these captions are live [so you may want to switch to roll-up captions]"), instead of presentationally ("these captions are roll-ups"). The UA can then choose, based on this flag, implementation experience and user preferences. This points towards adding a metadata attribute in the header of the caption file as to what type of content should be expected.


Example:

  WEBVTT
  renderingHint=rollup
 
  00:00:05.940 --> 00:00:10.610
  WHEN I GET A SICK BIRD,

  00:00:07.040 --> 00:00:11.700
  THAT JUST STOPS EVERYTHING

  00:00:09.410 --> 00:00:12.910
  FROM MOVING FROM MY PLACE

  00:00:10.610 --> 00:00:14.100
  TO ANYWHERE ELSE.

  ...

This example would with current rendering rules be rendered as follows:

  Screen1:
  WHEN I GET A SICK BIRD,

  Screen2:
  THAT JUST STOPS EVERYTHING
  WHEN I GET A SICK BIRD,

  Screen3:
  FROM MOVING FROM MY PLACE
  THAT JUST STOPS EVERYTHING
  WHEN I GET A SICK BIRD,

  Screen4:
  FROM MOVING FROM MY PLACE
  THAT JUST STOPS EVERYTHING
  TO ANYWHERE ELSE.

A preference setting that would allow text movement on collisions would be rendered as follows:

  Screen1:
  WHEN I GET A SICK BIRD,

  Screen2:
  WHEN I GET A SICK BIRD,
  THAT JUST STOPS EVERYTHING

  Screen3:
  WHEN I GET A SICK BIRD,
  THAT JUST STOPS EVERYTHING
  FROM MOVING FROM MY PLACE

  Screen4:
  THAT JUST STOPS EVERYTHING
  FROM MOVING FROM MY PLACE
  TO ANYWHERE ELSE.

A preference setting for moving up or down would be possible, too, e.g. with down movement:

  Screen1:
  WHEN I GET A SICK BIRD,

  Screen2:
  THAT JUST STOPS EVERYTHING
  WHEN I GET A SICK BIRD,

  Screen3:
  FROM MOVING FROM MY PLACE
  THAT JUST STOPS EVERYTHING
  WHEN I GET A SICK BIRD,

  Screen4:
  TO ANYWHERE ELSE.
  FROM MOVING FROM MY PLACE
  THAT JUST STOPS EVERYTHING
  WHEN I GET A SICK BIRD,


Discussion:

  • (+) Allows the recreation of the scrolling effect.
  • (+) Allows for up and down scrolling.
  • (+) Only a small change to the WebVTT format is required with a metadata hint.
  • (+) cue text is not repeated
  • (+) markup of text is not repeated
  • (+) the rendering continuation between cues can be upheld even if the rendering area's cue settings change (e.g. if the rollup has to move from the bottom to the top of the viewport because there is some burnt-in text visible at the bottom of the screen that would be obstructed by the caption text)
  • (-) the number of rollup lines cannot be changed easily because every single rollup cue has a fixed specification of its end time
  • (-) the author has very little influence on the presentation of the text, e.g. cannot control the scrolling behavior in CSS
  • (+) it's simple
  • (+) in clients that don't support it - it just degrades to a non-rollup display mode.

8. Give Rendering Objects a name

Analysis:

We should not regard rollup captions as a problem to be solved with individual cue tagging, or CSS (which doesn't seem very clean, especially when a caption track might be viewed in a standalone player).


Proposal:

We should instead define abstract 'rollup' objects, which are roughly like meta-cues - they have a lot of the same settings, but instead of containing text, they contain cues. These would map to CSS boxes positioned within the video render area.


Example:

We define 'renderArea' containers with a special cue, something like this:

       DEFINE --> renderArea(lines: 3, scroll: up, line: 20%, position: 10%, size:80%, name:mainRollup)

This recognises the renderArea as a first-class object that can be positioned independently of other cues - and integrates better with the cue positioning model than a CSS driven solution. It allows to render the cues inside it as rollup text.

In the above example, we're specifying a rollup size of two cues, scrolling direction of "up" (down/left/right), line position of 20%, text position of 10%, and a size of 80%. With the exception of the cue count, all of that matches what can be defined on regular cues - which is the whole point.

Then there's a name. In certain and very specific circumstances, you may have some cues in the rollup, and some cues outside it, obeying the normal positioning model - or two or more rollups for different content. With that in mind, a name is required to address the renderArea in order to assign cues to it, but could also be used to select it in CSS using it as a class identifier on the containing cues.

  WEBVTT

  DEFINE --> renderArea(lines: 3, scroll: up, line: 20%, position: 10%, size:80%, name:mainRollup)
 
  00:00:05.940 --> 00:00:10.610 renderTo:mainRollup
  WHEN I GET A SICK BIRD,

  00:00:07.040 --> 00:00:11.700 renderTo:mainRollup
  THAT JUST STOPS EVERYTHING

  00:00:09.410 --> 00:00:12.910 renderTo:mainRollup
  FROM MOVING FROM MY PLACE

  00:00:10.610 --> 00:00:14.100 renderTo:mainRollup
  TO ANYWHERE ELSE.

  ...

  00:00:30.000 --> 00:00:35.000 renderTo:none size:100%
  © Copyright message which does not appear inside the rollup

Because the id is pretty long, you wouldn't want to include it against every cue - so you could use the previously proposed DEFAULTS cue to specify it:

       DEFAULTS --> line:40% size:80% renderTo:mainRollup

That way you could specify a single rollup right at the top of the VTT file, set it as a default - and the renderer is then aware of the rollup and can take into account text size and override specific aspects of it for accessibility if required - and of course - can make the content appropriately available to assistive technology, in a way no different to ordinary cues.

This results in a much simpler example:

  WEBVTT

  DEFINE --> renderArea(name:mainArea)
  DEFAULTS --> line:20% size:80% position:10% renderTo:mainArea
 
  00:00:05.940 --> 00:00:10.610
  WHEN I GET A SICK BIRD,

  00:00:07.040 --> 00:00:11.700
  THAT JUST STOPS EVERYTHING

  00:00:09.410 --> 00:00:12.910
  FROM MOVING FROM MY PLACE

  00:00:10.610 --> 00:00:14.100
  TO ANYWHERE ELSE.

  ...

  00:00:30.000 --> 00:00:35.000 render:none size:100%
  © Copyright message which does not appear inside the rollup

The other important benefit is that it can support vertical and vertical-lr rollups. The content authors and web developers don't have to consider problems of internationalisation, cue lengths and font sizes as part of their caption file, as the UA handles all of that.

Discussion:

  • (+) text is not repeated
  • (+) markup of text is not repeated
  • (+) text can be incrementally supplied (e.g. in a live scenario)
  • (+) the movement will be applied independently of other cues, e.g. we could have a single sign translation cue follow a sign as it moves across the screen
  • (+) cues are grouped
  • (+) smooth location changes can be introduced using CSS by addressing the cues as a group, e.g. ::cue(.mainRollup)
  • (+) the number of rollup lines can be changed easily
  • (-) just like the DEFAULTS, this introduces markup that is not part of a cue and is therefore hard to serialize into an in-band encapsulation such as in MPEG-4 or WebM.
  • (+) it is fairly simple.
  • (+) in clients that don't support it - it just degrades to a non-rollup display mode.

9. Re-render all text

Analysis:

What we want is that text that is added to the caption screen, requires all the text that is already rendered to be re-positioned as though the text was all appearing at that instant together.


Proposal:

Every time there is a cue change, re-layout the cues that are active and need to be rendered. All cues that go onto the same location on the screen are merged into one cue and then rendered together.

Example:

  WEBVTT
 
  00:00:05.940 --> 00:00:10.610
  WHEN I GET A SICK BIRD,

  00:00:07.040 --> 00:00:11.700
  THAT JUST STOPS EVERYTHING

  00:00:09.410 --> 00:00:12.910
  FROM MOVING FROM MY PLACE

  00:00:10.610 --> 00:00:14.100
  TO ANYWHERE ELSE.

  ...

This will be rendered as follows, with re-layouts at the beginning of each cue:

  Screen1:
  WHEN I GET A SICK BIRD,

  Screen2:
  WHEN I GET A SICK BIRD,
  THAT JUST STOPS EVERYTHING

  Screen3:
  WHEN I GET A SICK BIRD,
  THAT JUST STOPS EVERYTHING
  FROM MOVING FROM MY PLACE

  Screen4:
  WHEN I GET A SICK BIRD,
  THAT JUST STOPS EVERYTHING
  FROM MOVING FROM MY PLACE
  TO ANYWHERE ELSE.


Discussion:

  • (+) Allows the recreation of the scrolling effect.
  • (-) Only allows for scrolling in one direction: up, since the bottom line of the cue is "fixed".
  • (+) Only a small change to the WebVTT format is required with a metadata hint.
  • (+) cue text is not repeated
  • (+) markup of text is not repeated
  • (+) the rendering continuation between cues can may be interrupted briefly, since all text would be removed and then re-laid out.
  • (+) the number of rollup lines can be changed easily
  • (-) the author has very little influence on the presentation of the text, e.g. cannot control the scrolling behavior in CSS
  • (+) it's simple
  • (+) in clients that don't support it - it just degrades to a non-rollup display mode where the text scrolls down
  • (+) it's possible to extend this idea by defining which boundary of the cue display is fixed, e.g. fixed: bottom/top, so it may repaint by adding the text at the top (when fixed:bottom) or bottom (when fixed:top)

10. Define 'windows' (building on 8 above)

Analysis:

The motivations to do roll-up are (a) people need it and will fudge it by replicating text (2, above) if we don't specify it, we should; (b) regulatory compliance, as the FCC requires windows with paint-on, pop-on, or roll-up behavior and a background color. It seems we should also look at meeting the compliance requirement for colored windows as well as scrolling windows. That means we need 3 colors: window, text background, and text.

We therefore need to be able to do an acceptable job of replicating CEA 708 semantics, when appropriate (as the regulations refer to it). CEA 708 has up to 8 numbered windows, which have a background color, layering, and scroll behavior. The background is painted when the window is defined, and erased when the window is deleted (i.e. is independent of any cue text that may be in it).

WebVTT has (in contrast to most other formats, such as TTML or 708) excellent random accessibility -- one can operate on the file knowing only the header and the cues that overlap and extend from the start-time desired. We absolutely want to keep this property. Live conversion of 708 to WebVTT is not a goal. Unlike 708, we therefore do NOT want to be able to define regions at any point in the WebVTT file, but only in the header.

We also want to retain that the semantics of what is happening are expressed in the VTT itself, and that any CSS association is purely for styling (i.e. not require a CSS engine).

Proposal:

This idea is similar to proposal 8, with a few tidbits gleaned from the other proposals. 8 seemed the closest to what we need; other ideas (like 5) were trying to avoid defining a 'window', but it's clear we need to go the other way.

Since the word 'window', used in CEA 708, is rather heavily used in computers, we use the term 'region' here.

There are three parts:

  1. defining the regions,
  2. associating optional styling with the region,
  3. drawing text in them (which may cause scrolling).

This proposal has a syntax that allows definition of the region based on the geometry of the implicit region into which drawing happens today. The definition of the region specifies its:

  • width, in terms of left and right edge percentages of the implicit top-level region that is the viewport;
  • height, either in terms of bottom and top percentages or line numbers; if line numbers are used, then the text height that text without any explicit style would have (i.e. determined solely by the user-agent and/or user preferences) is used to determine line heights;
  • id, used by cues to associate with the region, and also optionally followed by a class-tag (id.class) to allow association with CSS (including specifying a color)
  • painting behavior: as now, the default is pop-on; optionally it can specify scroll up/down/left/right. (Paint-on is handled already by 'karaoke timing').
  • front-to-back layering indicator, defaulting to 0, with more negative towards the front; for regions with the same ordering, regions declared later are in front of regions declared earlier.

The cues take an extra optional keyword that specifies their region, by id. Cue line and position are relative to their region box; cues that don't reference a specific region are implicitly referencing the anonymous region that is the entire viewport.

Regions are painted in their background color as long as they have is any cue that is active (time is within the bounds of their start and end times). So, a trivial way to map from CEA 708 is to have a cue whose start time matches the define window command, whose end-time matches the erase window, and which paints (for example) a single space character. However, for live streams, when the erase time cannot be foreseen at define time, the same effect can be achieved by using multiple cues.

We can allow re-definition of the implicit, anonymous, 'whole viewport' region, as well, as this makes it possible to define scrolling for it.

The timing of the scroll can be controlled by CSS, otherwise defaults similar to those in CEA 708 apply. The CSS attributes that apply are <tbd> (transitions on the ::cue x-position and y-position??). There is a new pseudo-selector for VTT regions (::region?), which allows defining their fill color ('window color'). We can associate styles by the ".name" syntax after the ID, as for text.

When a cue is painted into a region that is marked as scrolling along the line-advance axis (e.g. vertically for horizontal text), then existing text on the indicated line, and succeeding lines in the scroll direction, are moved in the scroll direction, as each line of the cue is added.

(We don't need to specify ticker-tape scrolling, but maybe we can make it work like as follows. When a cue is painted into a region that is marked as scrolling along the character-advance axis (e.g. horizontally for horizontal text), then each line of text is added to the end opposite the scroll direction of the existing text on the identified line (i.e. to the right-hand end for leftward scrolling text, and the left-end for rightward scrolling text), and then scrolled in the indicated direction until it reaches the position specified by the 'align' attribute.)

Text is not only positioned relative to, but also clipped to, the region, in all cases. Note that text may be 'scrolled out of the region' (and hence clipped and invisible) by these effects, even though it remains 'active' (within the range of its start and end times).

Example:

WEBVTT
Region: id=fred.wRed position=10% size=90% line=-1 height=3  scroll=up  layer=-1
Region: id=bill      position=10% size=90% line=-5 height=30%  scroll=up  layer=0


00:00:05.940 --> 00:00:10.610 region:fred
WHEN I GET A SICK BIRD,

00:00:07.040 --> 00:00:11.700 region:fred
THAT JUST STOPS EVERYTHING

00:00:09.410 --> 00:00:12.910 region:bill
FROM MOVING FROM MY PLACE

00:00:10.610 --> 00:00:14.100 region:fred
TO ANYWHERE 
ELSE OR BEYOND

Discussion:

  • (+) text is not repeated
  • (+) markup of text is not repeated
  • (+) text can be incrementally supplied (e.g. in a live scenario)
  • (+) cues are grouped
  • (+) smooth location changes can be introduced using CSS
  • (+) serialization into an in-band encapsulation such as in MPEG-4 or WebM can be done as all the extra material (region definitions) is in the header
  • (+) it is fairly simple, and uses CSS only for optional presentational questions; semantics and information remains in VTT
  • (+) in clients that don't support it - it just degrades to a non-rollup display mode, particularly for the case where only the whole-viewport region is used
  • (+) we achieve FCC compliance, including window coloring
  • (+) existing files remain completely valid

11. Add rollup cue setting (building on 3/4 above)

The simplest thing might be add a rollup:x cue setting, where x is the number of lines. This is similar to #3/#4 but setting a number of lines instead of a named "box", because we only need 1 box at a time to replicate 608 rollup (only 708 has more than one rollup box and it's never used in practice).

To save bandwidth, rather than putting the position/line settings on every line, you could specify it for the first row and ask the decoder to inherit the previous position/line settings until a new setting is provided. (If that won't work, just repeat the line/position settings on every cue.)

Example:

00:00:05.940 --> 00:00:10.610 rollup:3 align:start line:90% position:10%
HERE ARE SOME

00:00:07.040 --> 00:00:11.700 rollup:3
3 LINE ROLL-UP CAPTIONS

00:00:09.410 --> 00:00:12.910 rollup:3
AT THE BOTTOM

00:00:10.610 --> 00:00:14.100 rollup:3
OF THE SCREEN.

00:00:20.000 --> 00:00:24.000 rollup:2 align:start line:10% position:10%
NOW SOME 2 LINE ROLL-UP

00:00:22.000 --> 00:00:27.000 rollup:2
CAPTIONS AT THE TOP

00:00:24.000 --> 00:00:27.000 rollup:2
OF THE SCREEN.

The pros/cons would appear to be:

  • (+) in clients that don't support it - it just degrades to a non-rollup display mode.
  • (+) it is fairly simple, both to encode and to decode
  • (+) text is not repeated
  • (+) other than rollup:x, cue settings are not repeated (except when it needs to change)
  • (+) text can be incrementally supplied (e.g. in a live scenario)
  • (+) the number of rollup lines can be changed easily or overridden by the user
  • (+) does not introduce any new syntax outside of cues (which could cause problems with older parsers or with encapsulation for MPEG-4 / WebM)
  • (+) additional cue settings could be used to support scrolling in other directions if needed, e.g. "scrolldirection:h" for roll-up of vertical text or a marquee effect
  • (-) only supports one rollup region at a time (but CEA-608/708 only need 1)