Bug 18400 - Define and document timestamp heuristics
Define and document timestamp heuristics
Status: RESOLVED FIXED
Product: HTML WG
Classification: Unclassified
Component: Media Source Extensions
unspecified
All All
: P2 normal
: ---
Assigned To: Mark Watson
HTML WG Bugzilla archive list
:
Depends on: 18642
Blocks:
  Show dependency treegraph
 
Reported: 2012-07-25 15:32 UTC by Aaron Colwell (c)
Modified: 2013-02-05 22:37 UTC (History)
3 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Aaron Colwell (c) 2012-07-25 15:32:21 UTC
There are several situations where heuristics are needed to resolve issues with the timestamps in media segments. The following list indicates issues the Chrome team has encountered so far :

1. How close does the end of one media segment need to be to the beginning of another to be considered a gapless splice? Media segments can't always align exactly, especially in adaptive content, and they may be close but don't overlap.

2. How far apart do track ranges need to be for the UA to consider media data to be missing? For example:  audio [5-10) video [5.033-10) and I seek to 5. Technically I don't have video @ t=5, but the UA should likely allow the seek to complete because 5.033 is "close enough".

3. How close do timestamps need to be to 0 to be equivalent to t=0? Content may not always start at exactly 0 so how much room do we want to allow here, if any? This may be related to #2, but I wanted to call it out just in case we wanted to handle the start time slightly differently.

4. How should the UA estimate the duration of a media segment if the last frame in the segment doesn't have duration information? (ie WebM clusters aren't required to have an explicit cluster duration. It's possible, but not required currently)

5. How should SourceBuffer.buffered values be merged into a single HTMLMediaElement.buffered? Simple range intersection? Should heuristic values like estimated duration (#4) or "close enough" values (#2) be applied before computing the intersection?


Text needs to be added to the spec to address these questions.
Comment 1 Aaron Colwell (c) 2012-07-25 15:35:58 UTC
Assigning to Mark Watson since he said he'd be willing to work on this in http://lists.w3.org/Archives/Public/public-html-media/2012Jul/0067.html and we've discussed bits and pieces of this in other threads on the list.
Comment 2 Mark Watson 2012-08-06 19:23:07 UTC
Proposal inline below:

(In reply to comment #0)
> There are several situations where heuristics are needed to resolve issues with
> the timestamps in media segments. The following list indicates issues the
> Chrome team has encountered so far :
> 
> 1. How close does the end of one media segment need to be to the beginning of
> another to be considered a gapless splice? Media segments can't always align
> exactly, especially in adaptive content, and they may be close but don't
> overlap.

More generally, if there is a gap in the media data in a Source Buffer, the media element should play continuously across the gap if the duration of the gap is less than 2 (?) video frame intervals or less than 2 (?) audio frame durations. Otherwise the media element should pause and wait for receipt of data.

> 
> 2. How far apart do track ranges need to be for the UA to consider media data
> to be missing? For example:  audio [5-10) video [5.033-10) and I seek to 5.
> Technically I don't have video @ t=5, but the UA should likely allow the seek
> to complete because 5.033 is "close enough".

This is covered by the rule above for (1).

If there is media within 2 (?) video frame intervals or 2 (?) audio frame durations of the seek position then playback can begin.

> 
> 3. How close do timestamps need to be to 0 to be equivalent to t=0? Content may
> not always start at exactly 0 so how much room do we want to allow here, if
> any? This may be related to #2, but I wanted to call it out just in case we
> wanted to handle the start time slightly differently.

I believe the start time should be zero. If the first frame is at time 33ms, then that means you should render 33ms of blank screen, then the first frame. Rules for whether playback can start are as above.

> 
> 4. How should the UA estimate the duration of a media segment if the last frame
> in the segment doesn't have duration information? (ie WebM clusters aren't
> required to have an explicit cluster duration. It's possible, but not required
> currently)

The rules above enable the UA to determine whether there is a real gap between segments. This obviates the need to know segment duration except for determination of the content duration. The content duration should just be set to the timestamp of the last video frame or the end of the last audio frame, whichever is later.

> 
> 5. How should SourceBuffer.buffered values be merged into a single
> HTMLMediaElement.buffered? Simple range intersection? Should heuristic values
> like estimated duration (#4) or "close enough" values (#2) be applied before
> computing the intersection?

The heuristics of (1) should be used to determine SourceBuffered.buffered. i.e. gaps of less than 2 frame intervals do not result in disjoint intervals in the SourceBuffered.buffered array.

Then the intersection of the SourceBuffered.buffered arrays for the active source buffers appears as the HTMLMediaElement.buffered.

> 
> 
> Text needs to be added to the spec to address these questions.

Comments first please and then I'll propose some text.
Comment 3 Mark Watson 2012-08-13 16:13:01 UTC
A further point occured to me on this one: there may be container formats which can explicitly indicate that there is no media between certain timestamps. For example a video format which could indicate that a block spanned time x to x+2s even though the last frame in the block is timerstamped 1.8s, say.

In particular, subtitle/text formats may have this property.

If the container has this property, then the heurisitics proposed above should not be used: there should be no gaps in the timeline at all.

In the ISO File Format case for adaptive streaming, this information is actually available from the Segment Index (but not in the Movie Fragments).

Perhaps we ought to be able to specify the segment duration explicitly in the append call ? This would allow to application to communicate what it knows about block sized to the UA.
Comment 4 Aaron Colwell (c) 2012-08-13 20:38:03 UTC
comments inline..

(In reply to comment #2)
> Proposal inline below:
> 
> (In reply to comment #0)
> > There are several situations where heuristics are needed to resolve issues with
> > the timestamps in media segments. The following list indicates issues the
> > Chrome team has encountered so far :
> > 
> > 1. How close does the end of one media segment need to be to the beginning of
> > another to be considered a gapless splice? Media segments can't always align
> > exactly, especially in adaptive content, and they may be close but don't
> > overlap.
> 
> More generally, if there is a gap in the media data in a Source Buffer, the
> media element should play continuously across the gap if the duration of the
> gap is less than 2 (?) video frame intervals or less than 2 (?) audio frame
> durations. Otherwise the media element should pause and wait for receipt of
> data.

[acolwell] Sounds like a reasonable start. How is the "video frame interval" and "audio frame duration" determined? Media segments could have different frame rates, and codecs like Vorbis have variable audio frame durations (ie long & short overlap windows).

> 
> > 
> > 2. How far apart do track ranges need to be for the UA to consider media data
> > to be missing? For example:  audio [5-10) video [5.033-10) and I seek to 5.
> > Technically I don't have video @ t=5, but the UA should likely allow the seek
> > to complete because 5.033 is "close enough".
> 
> This is covered by the rule above for (1).
> 
> If there is media within 2 (?) video frame intervals or 2 (?) audio frame
> durations of the seek position then playback can begin.

[acolwell] I agree.

> 
> > 
> > 3. How close do timestamps need to be to 0 to be equivalent to t=0? Content may
> > not always start at exactly 0 so how much room do we want to allow here, if
> > any? This may be related to #2, but I wanted to call it out just in case we
> > wanted to handle the start time slightly differently.
> 
> I believe the start time should be zero. If the first frame is at time 33ms,
> then that means you should render 33ms of blank screen, then the first frame.
> Rules for whether playback can start are as above.

[acolwell] I agree.

> 
> > 
> > 4. How should the UA estimate the duration of a media segment if the last frame
> > in the segment doesn't have duration information? (ie WebM clusters aren't
> > required to have an explicit cluster duration. It's possible, but not required
> > currently)
> 
> The rules above enable the UA to determine whether there is a real gap between
> segments. This obviates the need to know segment duration except for
> determination of the content duration. The content duration should just be set
> to the timestamp of the last video frame or the end of the last audio frame,
> whichever is later.

[acolwell] This becomes more complicated when overlaps are involved. Without knowing the actual duration of segments it becomes tricky to resolve certain kinds of overlaps. I'll try to provide an example to illustrate the problem.


Initial source buffer state.
+-----------+--+--+----------+
:A          |A |A |A         |  
+-----------+--+--+----------+

A new segment gets appended and we don't know it's duration.
+--------+-???
:B       |B     
+--------+-???  

Resolve the overlap and assume the end of the segment goes until the next frame.
+--------+--+--+--+----------+
:B       |B |A |A |A         | 
+--------+--+--+--+----------+ 

Append the segment that is supposed to be right after B.
               +------+------+
               :C     |C     | 
               +------+------+ 

Resolve the overlap.
+--------+--+--+------+------+
:B       |B |A :C     |C     | 
+--------+--+--+------+------+ 

If B & C had been appended on a clear source buffer you would have gotten this which is likely what the application intended.
+--------+-----+------+------+
:B       |B    :C     |C     |
+--------+-----+------+------+

This is not a hypothetical example. We actually ran into this problem while trying to overlap Vorbis data.

Note that a "wait until the next segment is appended" rule won't help here because segments are not required to be appended in order and discontinuous appends are not explicitly signalled. 

Assuming a duration of 1-2 frame intervals can also get you into trouble because it may cause a keyframe to get dropped which could result in the loss of a whole GOP.

> 
> > 
> > 5. How should SourceBuffer.buffered values be merged into a single
> > HTMLMediaElement.buffered? Simple range intersection? Should heuristic values
> > like estimated duration (#4) or "close enough" values (#2) be applied before
> > computing the intersection?
> 
> The heuristics of (1) should be used to determine SourceBuffered.buffered. i.e.
> gaps of less than 2 frame intervals do not result in disjoint intervals in the
> SourceBuffered.buffered array.
> 
> Then the intersection of the SourceBuffered.buffered arrays for the active
> source buffers appears as the HTMLMediaElement.buffered.

[acolwell] Ok. Does this also apply after endOfStream() is called? Currently Chrome returns the intersection for all ranges when in "open", but uses the intersection plus the union of the end ranges if they overlap in "ended". The main reason was to handle the case where the streams are slightly different lengths. The union on the last overlapping range at least allows buffered to reflect playing out to the duration if the streams are farther than 2 intervals different.
Comment 5 Mark Watson 2012-08-14 21:15:07 UTC
(In reply to comment #4)
> comments inline..
> 
> (In reply to comment #2)
> > Proposal inline below:
> > 
> > (In reply to comment #0)
> > > There are several situations where heuristics are needed to resolve issues with
> > > the timestamps in media segments. The following list indicates issues the
> > > Chrome team has encountered so far :
> > > 
> > > 1. How close does the end of one media segment need to be to the beginning of
> > > another to be considered a gapless splice? Media segments can't always align
> > > exactly, especially in adaptive content, and they may be close but don't
> > > overlap.
> > 
> > More generally, if there is a gap in the media data in a Source Buffer, the
> > media element should play continuously across the gap if the duration of the
> > gap is less than 2 (?) video frame intervals or less than 2 (?) audio frame
> > durations. Otherwise the media element should pause and wait for receipt of
> > data.
> 
> [acolwell] Sounds like a reasonable start. How is the "video frame interval"
> and "audio frame duration" determined? Media segments could have different
> frame rates, and codecs like Vorbis have variable audio frame durations (ie
> long & short overlap windows).

I guess it would be fine to say that this is the immediately previous video frame interval or audio frame duration. It's just a heuristic after all.

> > 
> > > 
> > > 4. How should the UA estimate the duration of a media segment if the last frame
> > > in the segment doesn't have duration information? (ie WebM clusters aren't
> > > required to have an explicit cluster duration. It's possible, but not required
> > > currently)
> > 
> > The rules above enable the UA to determine whether there is a real gap between
> > segments. This obviates the need to know segment duration except for
> > determination of the content duration. The content duration should just be set
> > to the timestamp of the last video frame or the end of the last audio frame,
> > whichever is later.
> 
> [acolwell] This becomes more complicated when overlaps are involved. Without
> knowing the actual duration of segments it becomes tricky to resolve certain
> kinds of overlaps. I'll try to provide an example to illustrate the problem.
> 
> 
> Initial source buffer state.
> +-----------+--+--+----------+
> :A          |A |A |A         |  
> +-----------+--+--+----------+
> 
> A new segment gets appended and we don't know it's duration.
> +--------+-???
> :B       |B     
> +--------+-???  
> 
> Resolve the overlap and assume the end of the segment goes until the next
> frame.
> +--------+--+--+--+----------+
> :B       |B |A |A |A         | 
> +--------+--+--+--+----------+ 
> 
> Append the segment that is supposed to be right after B.
>                +------+------+
>                :C     |C     | 
>                +------+------+ 
> 
> Resolve the overlap.
> +--------+--+--+------+------+
> :B       |B |A :C     |C     | 
> +--------+--+--+------+------+ 
> 
> If B & C had been appended on a clear source buffer you would have gotten this
> which is likely what the application intended.
> +--------+-----+------+------+
> :B       |B    :C     |C     |
> +--------+-----+------+------+
> 
> This is not a hypothetical example. We actually ran into this problem while
> trying to overlap Vorbis data.
> 
> Note that a "wait until the next segment is appended" rule won't help here
> because segments are not required to be appended in order and discontinuous
> appends are not explicitly signalled. 
> 
> Assuming a duration of 1-2 frame intervals can also get you into trouble
> because it may cause a keyframe to get dropped which could result in the loss
> of a whole GOP.

I see your point. In DASH there are detailed rules that streams must conform to in order to avoid this problem. I don't see any other way to avoid it than to have such rules around the content itself.

If the example above was video, and the first A that follows B is an I-Frame, then assuming a later stop time for B would mean that the append of B would stomp this I-Frame and you would not be able to play back. If the first frame of some block ( A ...) data strictly follows the last frame of something else ( B ... B) then we can't really do anything other than put all those frames in the buffer, even if we end up with a very short frame interval.

So, yes, you end up with different outcomes depending on what you do. For video, provided all the frames are really from the same source material, it should not be a problem.

> 
> > 
> > > 
> > > 5. How should SourceBuffer.buffered values be merged into a single
> > > HTMLMediaElement.buffered? Simple range intersection? Should heuristic values
> > > like estimated duration (#4) or "close enough" values (#2) be applied before
> > > computing the intersection?
> > 
> > The heuristics of (1) should be used to determine SourceBuffered.buffered. i.e.
> > gaps of less than 2 frame intervals do not result in disjoint intervals in the
> > SourceBuffered.buffered array.
> > 
> > Then the intersection of the SourceBuffered.buffered arrays for the active
> > source buffers appears as the HTMLMediaElement.buffered.
> 
> [acolwell] Ok. Does this also apply after endOfStream() is called? Currently
> Chrome returns the intersection for all ranges when in "open", but uses the
> intersection plus the union of the end ranges if they overlap in "ended". The
> main reason was to handle the case where the streams are slightly different
> lengths. The union on the last overlapping range at least allows buffered to
> reflect playing out to the duration if the streams are farther than 2 intervals
> different.

What you describe sounds right for endOfStream()
Comment 6 Adrian Bateman [MSFT] 2012-10-22 01:42:11 UTC
Next step: reorganize append() description into sub-algorithms in order to introduce this proposal.
Comment 7 Adrian Bateman [MSFT] 2012-10-22 16:00:32 UTC
The append() algorithm will be broken up in bug 18642 and this will allow the proposal here to be included.
Comment 8 Aaron Colwell (c) 2012-12-28 21:47:29 UTC
Since out-of-order appends require an abort() call now, I believe this isn't as big of a problem anymore. With the current spec text, appending data that is before the last data appended w/o an intervening abort() will trigger an error. Appending data far beyond the last data won't generate an error, but will likely lead to a bad user experience. 

Should we still add a heuristic that requires media segments to be within 2 frame intervals of eachother to avoid developers from accidentally creating large gaps in the media when they forget to call abort()?
Comment 9 Aaron Colwell (c) 2013-01-31 18:11:43 UTC
Unless anyone objects, I'm just going to add step to the coded frame processing algorithm (https://dvcs.w3.org/hg/html-media/raw-file/default/media-source/media-source.html#sourcebuffer-coded-frame-processing) that triggers an error if it encounters a segment that starts more than 100ms from the 'last decode timestamp'.  

I think this should be sufficient to prevent developers from accidentally creating large hidden holes in the content when they forget to call abort() for an out-of-order append. It also allows a little bit of slop if the segment start time doesn't exactly match with the previous segments end time.
Comment 10 Aaron Colwell (c) 2013-02-05 22:37:31 UTC
Changes committed.
https://dvcs.w3.org/hg/html-media/rev/77975abeec41

Added check to signal an error on appends that are more than 100ms beyond the last frame in the previous append.