This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.
Currently the text in the spec says that media segments can be appended in any order but it doesn't outline any way to signal that a non-contiguous append is going to happen. The initial thinking was that the media segment starting timestamp would provide enough information to detect this automatically. Our work on updating the Chromium implementation has revealed that this is only true if you always know the exact duration of the media segment and the duration of the last blocks in the segment for each track. This is not the case for WebM, at least, given our current definitions for that format. Typically no duration information is stored in the cluster and block durations are determined by computing the difference between adjacent blocks. There is a way in WebM to specify block durations, but there is no requirement for the last block in the cluster to supply this information. Non-contiguous appends can throw off this block duration calculation logic. An append for data earlier in the timeline can be detected, but requires dropping the last blocks in the previous cluster because we won't get the blocks from the next contiguous cluster to compute their duration. Appending data for a point later in the timeline can cause incorrect block duration calculations because the UA has no way to differentiate this new segment from the one that is actually the next contiguous one. This problem raises several questions: 1. Can this happen in ISO BMFF or other formats we might want to support? 2. How should we signal that a discontinuity is happening? 3. Should the WebM section be updated to mandate duration information on the last frames? This requirement would likely exclude discontiguous appends on a majority of current WebM content. 4. Should we restrict the situations where non-contiguous appends can occur?
> 1. Can this happen in ISO BMFF or other formats we might want to support? Not in ISO BMFF; each sample's duration is defined explicitly. Yes for MPEG-TS, though, AFAIK.
(In reply to comment #0) > > This problem raises several questions: > 1. Can this happen in ISO BMFF or other formats we might want to support? ISO BMFF no, as we have the specification now. The last sample in a standard ISO file can have indeterminate duration, but for movie fragments there is always a sample duration (the default in the Movie Extends applies if nothing else does). > 2. How should we signal that a discontinuity is happening? Another way to think of it is that each append places the video frames at their specified positions on the timeline. What is left to work out is whether the gap between the last frame of one segment and the first frame of the next is so large that there must be content missing and playback should stall when you get to that point. Would it be ok to suggest that implementations apply heuristics when there is some doubt ? For example if the gap is bigger than twice the largest inter-frame gap observed. > 3. Should the WebM section be updated to mandate duration information on the > last frames? This requirement would likely exclude discontiguous appends on a > majority of current WebM content. > 4. Should we restrict the situations where non-contiguous appends can occur? What kind of restriction ? I can see four applications for a discontiguous append: (a) the user seeks to some point in the future. The append probably occurs with the video element in a "stalled" state with playback position equal to the start of the new segment. (b) video data for some segment happens to arrive at the client earlier than for the "next" segment - perhaps because the JS is getting fancy with parallel downloads etc. The segment filling the gap will arrive soon (and if it doesn't playback should stall) (c) In the live case, playback has fallen too far behind the "live leading edge" (due to playback stalls) and the player decides to skip a segment or two to catch up. It will need to set the playback position to the start of the skip-to segment. This is pretty much like a seek from the Media Element perspective (d) Imperfect segmentation on a bitrate switch means there is a missing frame or two between the end of one segment and the beginning of the next. Heuristics as suggested above essentially means being lenient with imperfect segmentation and playing back with a couple of frame drops vs stalling altogether.
(In reply to comment #2) > (In reply to comment #0) > > 2. How should we signal that a discontinuity is happening? > > Another way to think of it is that each append places the video frames at their > specified positions on the timeline. What is left to work out is whether the > gap between the last frame of one segment and the first frame of the next is so > large that there must be content missing and playback should stall when you get > to that point. > > Would it be ok to suggest that implementations apply heuristics when there is > some doubt ? For example if the gap is bigger than twice the largest > inter-frame gap observed. If we are going to use heuristics, I want to make sure we define them well so that everyone implements them the same way. Using the largest inter-frame gap could work. We would have to figure out what to do if the first media segment appended only has a single frame and we don't know the duration. Don't laugh! I've seen files where the first cluster only contains the keyframe. Perhaps we could define a reasonable initial value for the largest inter-frame gap, say 250ms, and use that until we actually see a multi-frame media segment and then update to the measured value. We may need to specify different starting values for audio and video since 250ms seems rather high for a default audio frame length. Something closer to 30-60ms would probably be better. > > 4. Should we restrict the situations where non-contiguous appends can occur? > > What kind of restriction ? I can see four applications for a discontiguous > append: > (a) the user seeks to some point in the future. The append probably occurs with > the video element in a "stalled" state with playback position equal to the > start of the new segment. > (b) video data for some segment happens to arrive at the client earlier than > for the "next" segment - perhaps because the JS is getting fancy with parallel > downloads etc. The segment filling the gap will arrive soon (and if it doesn't > playback should stall) > (c) In the live case, playback has fallen too far behind the "live leading > edge" (due to playback stalls) and the player decides to skip a segment or two > to catch up. It will need to set the playback position to the start of the > skip-to segment. This is pretty much like a seek from the Media Element > perspective > (d) Imperfect segmentation on a bitrate switch means there is a missing frame > or two between the end of one segment and the beginning of the next. Heuristics > as suggested above essentially means being lenient with imperfect segmentation > and playing back with a couple of frame drops vs stalling altogether. Yes. I think your suggestion above is probably the best way to handle all these cases.
Based on the discussion, I think leaving the text as is and allowing out of order appends w/o signalling is fine. MarkW is defining the timestamp heuristics needed as part of Bug 18400 (https://www.w3.org/Bugs/Public/show_bug.cgi?id=18400)