A Media Segment may contain an audio-video multiplex.
Transition between two Media Segments should take place on video frame boundary.
The start times of audio and video essence may be offset within a Media Segment. For instance, the first audio sample may be offset from the first video sample to facilitate seamless splicing.
The MSE specification implies in the diagram of Section 2.8.2 that the transition between two Media Segments occurs at the start of the added Media Segment plus timestampOffset.
Does the transition between two Media Segments occur exactly at the start of the added Media Segment plus timestampOffset?
Is timestampOffset alone sufficient to ensure that the transition between two Media Streams occurs on a video frame boundary when audio and video essence start time are offset?
Illustration provided at https://docs.google.com/open?id=0Bz7s0dhnv-7HYjhadTktTGhrd2M
- I think you're being too optimistic about current-generation media pipelines ;)
Output framerates are almost always fixed once by the platform, either at app start (on some devices) or long before it, and never (in my experience) renegotiated based on observed video framerate on-the-fly. The most you can hope for, then, is a jitter buffer to smooth frame timing. Frankly, though, even proper double-buffering is something to celebrate on most CE devices.
In other words, I think that the extra control you're requesting wouldn't result in user-visible improvements to quality outside of a lab. That's not to say we shouldn't consider how we can improve things in the future. But we should be careful about how far we're willing to hold back the huge QoE win of adaptive, app-controled fetching, and seamless splicing, in order to add features that won't have any user-visible benefit today.
- All of your timestampOffset problems go away if you use unmuxed media.
That shouldn't immediately shut down the discussion, but it is worth noting that the existence of this current limitation in API expressiveness is entirely due to the decision to use muxed content.
(On a personal note: the number of problems that we've seen at scale related specifically to interleaving/multiplexing has made me a passionate advocate for using unmuxed formats, and we don't even do splicing from multiple content sources on the client (yet). This isn't the place to go into details, but not only do you get more control over your content when it's unmultiplexed, you instantly slay an entire family of eldritch corner cases that will otherwise haunt your waking hours.)
Couple more data points in preparation for the discussion on Thursday.
> too optimistic about current-generation media pipelines ;)
I believe the suggested audio-video synchronization accuracy is supported in at least Blu Ray and Ultraviolet, e.g. Section 2.4 of CFF Media Format at http://www.uvvu.com/techspec-archive.php.
> In other words, I think that the extra control you're requesting wouldn't
> result in user-visible improvements to quality outside of a lab.
I believe that accurate synchronization (and hence splicing) is important for video editing applications and branching scenarios, e.g. director's cut vs theatrical cut versions. For one thing, the creative community at large has in the past expressed strong interest in making sure that content is presented as authored, no more no less. Furthermore, active content may be present at the splice in these scenarios, and so an inaccurate splice may cause artifacts.
Perhaps more fundamentally, unless the API exposes the necessary information, authoring and playback implementations do not even have the option to consistently achieve accurate audio-video sync, and interoperate.
Created attachment 1295 [details]
Audio-video multiplex splice use case
Created attachment 1296 [details]
Proposed splice image
Here is a proposed image for explaining how implementations can handle the splice. Would this diagram along with some descriptive text be sufficient to satisfy your concerns?
Created attachment 1297 [details]
Proposed splice image + tweaks
Nice. Attached are suggested tweaks.
I think it is sufficient to show (a) that the audio and video tracks are treated separately and (b) that the splice occurs exactly at the first video frame boundary in the second stream. I do not think it is necessary to depict audio splicing behavior, which I recommend be collected in an annex (see issue #19673).
I also think it would be good to capture in the prose both (a) and (b), in addition to the diagram.
Splices are resolved on a per track basis. I did not change this for muxed content because I wanted the splicing behavior to be consistent between muxed and demuxed content. Given the additional text added in this change, I believe content providers can create content in such a way that they can get the output they want without requiring special splicing behavior for muxed content.