This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.
The track buffer ranges variable is currently underspecified especially when it comes to content that contains B-frames. For example in a I P B B B coded frame sequence where each coded frame has a duration of 1 second, one could argue that if only the I and P coded frames were appended, the track buffer ranges should be [0, 1) [4,5). Another interpretation of this situation is that track buffered ranges should be [0,5) since the B-frames are technicaly optional and decoding could proceed just fine through this region of the timeline even if the B-frames were never appended. Steps that explicitly update the track buffer ranges variable should be added to the coded frame processing algorithm and the coded frame removal algorithm to avoid this potential source of interoperability problems.
I think the principle should be that if the playback would stall at some point, then this would be indicated as a gap in the buffer ranges, but if playback would continue then there should be no gap. Whether playback stalls or continues through a group of missing B-frames could be implementation dependent, but the buffered ranges should correctly indicate what the implementation is going to do if no more data is appended.
(In reply to Mark Watson from comment #1) > I think the principle should be that if the playback would stall at some > point, then this would be indicated as a gap in the buffer ranges, but if > playback would continue then there should be no gap. Agreed. > > Whether playback stalls or continues through a group of missing B-frames > could be implementation dependent, but the buffered ranges should correctly > indicate what the implementation is going to do if no more data is appended. Agreed. If an implementation chooses to stall though it seems like the P-frame should not be included in the buffered ranges until the B-frames arrive since exposing the P-frame portion would imply that we could transition to HAVE_CURRENT_DATA if you seek to the P-frame. It isn't clear to me that systems that need the B-frames to arrive would actually return the P-frame for display. Allowing the P-frame to be visible in the buffered ranges could be very confusing to application developers especially if the SourceBuffer has tons of data after the P-frame. They might not understand why the media element doesn't transition to HAVE_FUTURE_DATA or HAVE_ENOUGH_DATA if it appears like there is more than enough data to proceed. By keeping the P-frame out of the ranges util the B-frames arrive, I think it would be clearer to app developers that the SourceBuffer is still waiting for more data even though the P-frame was appended. WDYT?
I'm noting some further things I believe need clarification in track buffer range calculation in the spec: By example: 1) Append a video keyframe buffer A, whose presentation interval is roughly: [A..............) 2) Append another video keyframe buffer B, whose presentation interval is contained completely within A's: (B's PTS) >= ((A's PTS) + (1 microsecond)) and (B's PTS+duration) < (A's PTS+duration): [B......) What should be the track's buffered ranges at this point? Should A's duration be truncated to join the end of A with the beginning of B? Ambiguity exists: the spec isn't clear whether the track buffer contains: [A.[B......)....) ---> Render A, then B, then A again? Unlikely this is desired. Or: [A)[B......) ---> Render A, then B, then done (and possible gap is introduced until next buffered range). A further complexity is introduced if the initial append were followed by some dependent (non-key) frames: [A..............)[a1.....][a2.....][a3.....] After appending B, the impact of the spec ambiguity increases. Should the track buffer then contain: [A.[B......)....)[a1.....][a2.....][a3.....] ---> Render keyframe A, then keyframe B, then back to keyframe A again, then dependent frames a1, a2, and a3? Unlikely this is desired, especially if both A and B ended at the same time. Or: [A)[B......) ---> Render A, then B, then done (and even more likely there is a gap introduced until next buffered range). What should the buffered result be if an app issues Remove() to remove exactly the presentation interval for B? I think the sanest approach might be just: [A) --> just the first tiny bit of A, then a potentially even larger gap. Finally, the spec is not clear regarding how much of a "gap" can be introduced (in audio or video tracks) before a previously contiguous buffered range is split in two. In general, how close to a range must a coded frame group's presentation interval be to be considered continuous, contained within, that range? The coded frame processing algorithm describes this clearly for parsing a stream of new frames, but the spec is unclear regarding exactly how other operations like scattered appends, overlapped appends, removes, etc result in buffered range(s). At least one implementation (Chromium) tracks a maximum inter-frame distance for each track and uses this in a heuristic to determine range membership/continuity.
In context of Chromium and portions of comment #3, I've put together some illustrative tests at https://codereview.chromium.org/1041983002/
(In reply to Matt Wolenetz from comment #3) > I'm noting some further things I believe need clarification in track buffer > range calculation in the spec: > > By example: > > 1) Append a video keyframe buffer A, whose presentation interval is roughly: > [A..............) > 2) Append another video keyframe buffer B, whose presentation interval is > contained completely within A's: (B's PTS) >= ((A's PTS) + (1 microsecond)) > and (B's PTS+duration) < (A's PTS+duration): > [B......) > > What should be the track's buffered ranges at this point? Should A's > duration be truncated to join the end of A with the beginning of B? > Ambiguity exists: the spec isn't clear whether the track buffer contains: > [A.[B......)....) ---> Render A, then B, then A again? Unlikely this is > desired. > Or: > [A)[B......) ---> Render A, then B, then done (and possible gap is > introduced until next buffered range). This second option is what I'd expect to happen since this is essentially an overlap. This second option seems consistant with the audio behavior to me. > > A further complexity is introduced if the initial append were followed by > some dependent (non-key) frames: > [A..............)[a1.....][a2.....][a3.....] > After appending B, the impact of the spec ambiguity increases. Should the > track buffer then contain: > [A.[B......)....)[a1.....][a2.....][a3.....] ---> Render keyframe A, then > keyframe B, then back to keyframe A again, then dependent frames a1, a2, and > a3? Unlikely this is desired, especially if both A and B ended at the same > time. > Or: > [A)[B......) ---> Render A, then B, then done (and even more likely there > is a gap introduced until next buffered range). The second option seems appropriate here as well. The way I see it, you are essentially talking about an overlap situation here. Since a frame is being inserted between A and a1, the decode dependency chain is essentially broken so ISTM that this is essentially equivalent to other overlap scenarios. > > What should the buffered result be if an app issues Remove() to remove > exactly the presentation interval for B? I think the sanest approach might > be just: > [A) --> just the first tiny bit of A, then a potentially even larger gap. > Seem reasonable to me. I think these 2 issues should be moved into a separate bug though. This bug was originally intended to just specify how the contents of the track buffer be represented as a TimeRanges so the algorithms that reference the data buffered in track buffers was well defined. What you are talking about here has more to do with modifying the "coded frame processing" algorithm to truncate coded frames and remove stuff from the track buffer. > Finally, the spec is not clear regarding how much of a "gap" can be > introduced (in audio or video tracks) before a previously contiguous > buffered range is split in two. In general, how close to a range must a > coded frame group's presentation interval be to be considered continuous, > contained within, that range? The coded frame processing algorithm describes > this clearly for parsing a stream of new frames, but the spec is unclear > regarding exactly how other operations like scattered appends, overlapped > appends, removes, etc result in buffered range(s). In general, I believe the "coded frame group" concept captures this since they represent a group of adjacent frames that aren't considered to have any gaps. Step 6 of the "coded frame processing algorithm" somewhat addresses this as it determines what triggers the beginning of a new "coded frame group". I agree that there is likely unspeced behavior that Chrome has to essentially merge adjacent "coded frame groups" that are "close enough" to each other. In general I think this was intended to be an extension of the "2 frame duration" rule that appears in step 6. I definitely think address that issue should be a separate bug. > At least one implementation (Chromium) tracks a maximum inter-frame distance > for each track and uses this in a heuristic to determine range > membership/continuity. Yeah. This will probably need to be added to the spec in some form just to insure interoperability when appends occur in random order. I would encourage you to take a fresh look at this and try not to be biased by Chrome's current implementation. We definitely need some form of heuristic here, but the presentation global max inter-frame distance may not be the best option especially when mixing content with different frame rates. Perhaps a max within the current coded frame group might be a better alternative. I don't know.
This bug has been migrated to the GitHub issue tracker. Please follow/update progress using the GitHub issue: https://github.com/w3c/media-source/issues/15