Bug 21431 - Specify splicing behavior for text tracks
Specify splicing behavior for text tracks
Status: RESOLVED FIXED
Product: HTML WG
Classification: Unclassified
Component: Media Source Extensions
unspecified
PC All
: P2 normal
: ---
Assigned To: Aaron Colwell
HTML WG Bugzilla archive list
PRE_LAST_CALL
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-03-29 01:28 UTC by Aaron Colwell
Modified: 2013-06-25 14:40 UTC (History)
9 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Aaron Colwell 2013-03-29 01:28:27 UTC
I just realized that the spec has no text about how splicing text tracks should work. I need to review the algorithms a little more, but I believe that they might behave in suprising ways with particularly long cues.

Here are some initial questions that I think need to be answered:
1. If a media segment with cues overlaps existing cues in the source buffer what should happen?
2. Should existing cues in the SourceBuffer that are overlapped by cues in the beginning of a new media segment get truncated?
3. If an existing cue spans the entire new media segment, does get split into pieces or just stay visible for the whole period?
4. Do text cues have dependencies like video frames do?
5. I believe multiple cues starting at the same timestamp are allowed in a single text track. If so, should overlaps at that timestamp remove all the cues at the timestamp or should there be special rules that allow targetting individual cues?
Comment 1 Glenn Adams 2013-03-29 02:11:05 UTC
(In reply to comment #0)
> 4. Do text cues have dependencies like video frames do?

you mean dependencies like a B/P frame depending on a prior I frame?

> 5. I believe multiple cues starting at the same timestamp are allowed in a
> single text track.

correct

> If so, should overlaps at that timestamp remove all the
> cues at the timestamp or should there be special rules that allow targetting
> individual cues?

without proposing any specific behavior, i would suggest an approach where a set of general rules are defined to be used in the absence of text track type specific rules, and that when text track type specific rules apply, they may override any general rules
Comment 2 Silvia Pfeiffer 2013-03-29 03:39:40 UTC
(In reply to comment #0)
> I just realized that the spec has no text about how splicing text tracks
> should work. I need to review the algorithms a little more, but I believe
> that they might behave in suprising ways with particularly long cues.
> 
> Here are some initial questions that I think need to be answered:
> 1. If a media segment with cues overlaps existing cues in the source buffer
> what should happen?

Assuming we have a video file with a text track (such as WebVTT in WebM) that we split into segments (source buffers), then transmit and cobble together for display in the client, would we repeat a cue that is still active, but not actually part of the current segment in the original file?

I actually think repetition should be possible (as a kind of "backup"), but should not lead to duplicate display. So, we'd need to be able to identify repeated cues and drop them on the floor.

> 2. Should existing cues in the SourceBuffer that are overlapped by cues in
> the beginning of a new media segment get truncated?

If the new segment has a copy of an existing cue, it should not lead to duplicate display, i.e. one of the two needs to be dropped.


> 3. If an existing cue spans the entire new media segment, does get split
> into pieces or just stay visible for the whole period?

I think either should be possible, but just not lead to duplicate display.

> 4. Do text cues have dependencies like video frames do?

Each cue can stand on its own. The only kind of text track cue that has dependencies are chapters and we have therefore decided in WebM to move them to the file header and deliver in one set. Other cues should be completely independent.

> 5. I believe multiple cues starting at the same timestamp are allowed in a
> single text track. If so, should overlaps at that timestamp remove all the
> cues at the timestamp or should there be special rules that allow targetting
> individual cues?

What overlaps are you referring to? Such as: segment overlaps and the cues being delivered twice? IMHO, all we need to make sure is that cues have a unique identifier and when delivered twice don't get added to the list of TextTrackCues for rendering. The server then has to make sure to provide the same identifier for the same cue in any segmentation.
Comment 3 Glenn Adams 2013-03-29 03:56:54 UTC
(In reply to comment #2)
> (In reply to comment #0)
> > 5. I believe multiple cues starting at the same timestamp are allowed in a
> > single text track. If so, should overlaps at that timestamp remove all the
> > cues at the timestamp or should there be special rules that allow targetting
> > individual cues?
> 
> What overlaps are you referring to? Such as: segment overlaps and the cues
> being delivered twice? IMHO, all we need to make sure is that cues have a
> unique identifier and when delivered twice don't get added to the list of
> TextTrackCues for rendering. The server then has to make sure to provide the
> same identifier for the same cue in any segmentation.

Keep in mind that a cue's identifier might be the empty string and if not an empty string, does not need to be unique. So, it is permissible to have two cues in a text track list that have an empty string as identifier and share the same start and end times.
Comment 4 Silvia Pfeiffer 2013-03-29 04:09:04 UTC
(In reply to comment #3)
> (In reply to comment #2)
> > (In reply to comment #0)
> > > 5. I believe multiple cues starting at the same timestamp are allowed in a
> > > single text track. If so, should overlaps at that timestamp remove all the
> > > cues at the timestamp or should there be special rules that allow targetting
> > > individual cues?
> > 
> > What overlaps are you referring to? Such as: segment overlaps and the cues
> > being delivered twice? IMHO, all we need to make sure is that cues have a
> > unique identifier and when delivered twice don't get added to the list of
> > TextTrackCues for rendering. The server then has to make sure to provide the
> > same identifier for the same cue in any segmentation.
> 
> Keep in mind that a cue's identifier might be the empty string and if not an
> empty string, does not need to be unique. So, it is permissible to have two
> cues in a text track list that have an empty string as identifier and share
> the same start and end times.

Right,  I was talking about a block identifier inside the video file format that identifies a cue uniquely for that stream and is transmitted to the other end. The identifier in the TextTrackCue object on the HTML level would certainly not be sufficient. Is something like a block identifier used in WebM or MPEG?
Comment 5 Cyril Concolato 2013-03-29 08:12:42 UTC
(In reply to comment #0)
> I just realized that the spec has no text about how splicing text tracks
> should work. I need to review the algorithms a little more, but I believe
> that they might behave in suprising ways with particularly long cues.
> 
> Here are some initial questions that I think need to be answered:
> 1. If a media segment with cues overlaps existing cues in the source buffer
> what should happen?
This is a tricky question because the overlap may be 'natural' in the original content (at least in WebVTT) but if you're splicing a movie with subtitles with an ad with subtitles, you probably don't want to keep rendering the movie subtitles in the ad. 

> 2. Should existing cues in the SourceBuffer that are overlapped by cues in
> the beginning of a new media segment get truncated?
I would say if the media segment starts with a RAP, yes. It might be problematic if it doesn't.

> 3. If an existing cue spans the entire new media segment, does get split
> into pieces or just stay visible for the whole period?
> 4. Do text cues have dependencies like video frames do?
In a sense, yes. You can create WebVTT streams where all 'frames' (access units) are not RAP. Rendering of a WebVTT cue (in particular line positioning) may depend on whether there is already a cue being displayed. So defining WebVTT RAP depends on whether or not you want exact rendering. But it is possible to rewrite a WebVTT file to force 'true' RAP. You might want to have a look at:
http://concolato.wp.mines-telecom.fr/2012/09/12/webvtt-streaming/

> 5. I believe multiple cues starting at the same timestamp are allowed in a
> single text track. 
Yes.

> If so, should overlaps at that timestamp remove all the
> cues at the timestamp or should there be special rules that allow targetting
> individual cues?
I don't see why you'd want to target individual cues. I think there could be a flag indicating whether you want to keep existing overlapping cues or whether you want to cut them short.
Comment 6 Cyril Concolato 2013-03-29 08:20:42 UTC
(In reply to comment #2)
> (In reply to comment #0)
> > I just realized that the spec has no text about how splicing text tracks
> > should work. I need to review the algorithms a little more, but I believe
> > that they might behave in suprising ways with particularly long cues.
> > 
> > Here are some initial questions that I think need to be answered:
> > 1. If a media segment with cues overlaps existing cues in the source buffer
> > what should happen?
> 
> Assuming we have a video file with a text track (such as WebVTT in WebM)
> that we split into segments (source buffers), then transmit and cobble
> together for display in the client, would we repeat a cue that is still
> active, but not actually part of the current segment in the original file?
> 
> I actually think repetition should be possible (as a kind of "backup"), but
> should not lead to duplicate display. So, we'd need to be able to identify
> repeated cues and drop them on the floor.
> 
> > 2. Should existing cues in the SourceBuffer that are overlapped by cues in
> > the beginning of a new media segment get truncated?
> 
> If the new segment has a copy of an existing cue, it should not lead to
> duplicate display, i.e. one of the two needs to be dropped.
Let's take an example. You have a WebVTT file that has a single cue lasting from 0 to 30 seconds. You split this file into 3 x 10s segments. I think in that case that the cue in segment 1 should have a duration of 10s not of 30s. In WebVTT over MP4, segment 2 will contain a sample marked as continuation of the previous from 10 to 20. Same for segment 3. So assuming you drop segment 2 and display some other content (possibly with subtitles) for 10s and then push segment 3, there won't be duplicate display. 

> 
> 
> > 3. If an existing cue spans the entire new media segment, does get split
> > into pieces or just stay visible for the whole period?
> 
> I think either should be possible, but just not lead to duplicate display.
> 
> > 4. Do text cues have dependencies like video frames do?
> 
> Each cue can stand on its own. 
From a coding perspective, yes. From a rendering perspective, not always.

> The only kind of text track cue that has
> dependencies are chapters and we have therefore decided in WebM to move them
> to the file header and deliver in one set. Other cues should be completely
> independent.
> 
> > 5. I believe multiple cues starting at the same timestamp are allowed in a
> > single text track. If so, should overlaps at that timestamp remove all the
> > cues at the timestamp or should there be special rules that allow targetting
> > individual cues?
> 
> What overlaps are you referring to? Such as: segment overlaps and the cues
> being delivered twice? IMHO, all we need to make sure is that cues have a
> unique identifier and when delivered twice don't get added to the list of
> TextTrackCues for rendering. The server then has to make sure to provide the
> same identifier for the same cue in any segmentation.
Comment 7 Silvia Pfeiffer 2013-03-29 22:23:09 UTC
(In reply to comment #6)
>
> Let's take an example. You have a WebVTT file that has a single cue lasting
> from 0 to 30 seconds. You split this file into 3 x 10s segments. I think in
> that case that the cue in segment 1 should have a duration of 10s not of
> 30s. In WebVTT over MP4, segment 2 will contain a sample marked as
> continuation of the previous from 10 to 20. Same for segment 3. So assuming
> you drop segment 2 and display some other content (possibly with subtitles)
> for 10s and then push segment 3, there won't be duplicate display. 

Out of curiosity: What would happen in MP4 if the first segment has the cue lasting 30s and segment 2 and 3 indicating a continuation from segment 1's cue for their timings? Would they be rendered twice?
Comment 8 David Singer 2013-04-05 17:32:25 UTC
I think we have to be careful to distinguish to between editing operations:  I want 30 seconds of this content appended after a 3-second green-screen, and then after all that I need 2 minutes of this other content;  from segmentation/delivery operations:  I want to deliver this content in roughly 5-second segments.

In the first, it's important that cues get truncated to match the edits.  This is a content-level operation.

In the second, segmentation software typically doesn't look at or below the content level.  If there are 'long duration' video frames (e.g. a slide-show, a slow time-lapse), that's the way it is.

The VTT-in-MP4 work has focused on making it possible to random access, edit, splice, fragment long cues to make segmentation easier, and so on, and know whether some cue is still active.  What Silvia suggests -- a repetition that is labelled as such -- is indeed possible.

The DASH work explicitly considered that segment boundaries may be 'ragged edges' -- some tracks may have content that persists after the boundary. But ideally the degree of raggedness is not 'large' (comparable to the segment length).

Both are possible in MP4: if you're authoring for segmented delivery then I would suggest segmenting cues also so that cues (and repetitions) are shorter than the shortest expected segment length. Otherwise, you have a significant ragged-edge issue.
Comment 9 Silvia Pfeiffer 2013-04-08 00:02:22 UTC
David: do you have a public link to the WebVTT in MPEG spec?
Comment 10 Aaron Colwell 2013-04-18 18:39:20 UTC
So I've take another look at the existing MSE spec text and I believe that the splicing behavior for text tracks is sane if the cues do not overlap. Basically any existing cues in the SourceBuffer will get removed if new cues overlap them. I think that is a reasonable default behavior.

Based on the discussion w/ Glenn on the last MSE call (http://www.w3.org/2013/04/09-html-media-minutes.html#item03) I think this may be enough to resolve this bug. I think the plan was to only specify the default behavior and then more advanced behavior for specific text track formats could be defined elsewhere later. For example the default rules make stair-step overlapping cues in WebVTT not work properly because each cue overlaps the previous one and therefore causes it to get removed from the SourceBuffer. This situation can be converted to a non-overlapping form as Cyril suggests.

Are there any objections to be resolving this bug?
Comment 11 Adrian Bateman [MSFT] 2013-04-23 20:25:03 UTC
Discussed at the F2F meeting. Glenn will provide non-normative text to indicate that the format of the text track might override the default behaviour.
Comment 12 Glenn Adams 2013-05-07 15:30:21 UTC
(In reply to comment #11)
> Discussed at the F2F meeting. Glenn will provide non-normative text to
> indicate that the format of the text track might override the default
> behaviour.

I'm not completely finished reviewing the current spec text, but at first order I think we may need to qualify some of the terms used in 3.5.7 Coded Frame Processing step 1 sub-steps 12-13:

* spliced frame => audio spliced frame
* overlapped frame => overlapped audio frame
* overlapped frame presentation timestamp => overlapped audio frame presentation timestamp

Then, copy sub-steps 12-13 into new sub-steps 14-15, replacing 'audio' with 'text'.

Then, introduce new section 3.5.12 Text Splice Frame Algorithm with content [TBD - I will propose something shortly].

I think we also need to add notes under 3.5.7 steps 1 and 2 regarding presentation and decode timestamps for timed text "frames", since (1) these are often not identified as such in timed text formats, (2) some time text formats require processing in order to serialize time intervals, and (3) some time text formats have only an implied (or even no) presentation or decode timestamp. For example of the latter, consider a "metadata" text track which is used to expose MPEG-2 PSI data (PAT, PMT).
Comment 13 Aaron Colwell 2013-05-23 18:31:07 UTC
Marking all pre-Last Call bugs
Comment 14 Aaron Colwell 2013-06-01 21:38:15 UTC
Changes committed
https://dvcs.w3.org/hg/html-media/rev/1ac9c2205a7b

Reworked coded frame processing algorithm so that audio, video, and text splicing are clearly marked now. Added Text Splice Frame algorithm as requested. It currently just truncates partially overlapped frames which I think is a reasonable first stab at text track splicing behavor. 

I also added the note indicating that special rules may apply for determining text track presentation and decode timestamps.

Please let me know if you'd like further changes made.
Comment 15 Glenn Adams 2013-06-25 14:26:16 UTC
(In reply to comment #14)
> Changes committed
> https://dvcs.w3.org/hg/html-media/rev/1ac9c2205a7b
> 
> Reworked coded frame processing algorithm so that audio, video, and text
> splicing are clearly marked now. Added Text Splice Frame algorithm as
> requested. It currently just truncates partially overlapped frames which I
> think is a reasonable first stab at text track splicing behavor. 
> 
> I also added the note indicating that special rules may apply for
> determining text track presentation and decode timestamps.
> 
> Please let me know if you'd like further changes made.

This looks fine to me. You can move to RESOLVED/FIXED IMO.