Meeting minutes
Alicia: Open questions about text tracks in MSE
… I assume you're familiar with MSE, but not necessarily about text track formats
… Out of band formats like SRT and WebVTT, or inband formats in the media container such as MP4, WebM, Matroska
… I'll introduce WebVTT and features that make implementation of inband tracks tricky
… Challenges and open questions on implementation of text tracks in MSE
WebVTT
Alicia: This has been supported in browsers for a long time. It's a reasonable first target. We support it as an out of band text track format. We could also support it in-band when you call appendBuffer()
… The syntax is based on SRT. There are cues with start and end timestamps and content, which can include markup for styling
… Cues can have setting, to allow customisation of the cue or alignment
… When you load the cues, you have APIs to retrieve the cues and control them programmatically
… Comment blocks are useful when authoring text tracks
… WebVTT documents can contain stylesheets, which must come before any cues in the file
… Regions allow you to define specific portions of the video where cues will appear
… WebVTT allows cues to overlap in time
… An example of closed captions, and the textual representation of sound effects, both of which can happen at the same time
… You can have delayed parts in a cue, using angle brackets
… Now let's look at in-band WebVTT, when you put it into a container format. Two formats: ISO BMFF, which I'll call MP4, and WebM / Matroska. WebM is a subset of Matroska
… I'm not aware of any representation in MP2TS
ISO BMFF
Alicia: For ISO BMFF, we have two specs. WebVTT is in Part 30, along with TTML
… The init segment, moov box, there's a WebVTT sample entry, codec configuration
… This has two boxes inside: one for the file header and style sheets, then (optionally) the source label box, a URI that uniquely identifies the WebVTT document
… For media segments, the timing of the cues is handled by the container. Cues are handled like regular video frames. The difference is because WebVTT allows overlapping cues, but MP4 isn't normally meant to be used that way, so the cues are split into non-overlapping frames
… Two types of frames: gaps, which represent the absence of a cue, and non-gaps, where you have vtt cue boxes and vtt additional boxes
… In vtt cue box, you have an optional source id box, which together with the source label box allows a cue to be uniquely identified
… If you mix and match different WebVTT documents that have been muxed into MP4 you can still uniquely identify the cues
… A cue time box is used for cues with delayed parts. You write the original start time of the cue, used as a reference to compute the time of the delayed parts. If there's an edit list, the delayed parts still work
WebM
Alicia: There are two kinds of representations. One from when it was less mature, but has adoption, e.g., in ffmpeg
… As a consequence of being early, it doesn't support the file header, so we can't include stylesheets, also delayed parts
… The later draft, from 2018, on Matroska, addresses both those problems. Delayed parts are defined as the offset from the start of the frame
… Commonalities and differences between the MP4 and WebM representations. Timing is handled by both containers, but gaps aren't explicitly encoded.
MSE
Alicia: Several questions with MSE and text tracks, and other related topics
… How many coded frames is a WebVTT cue?
… Should it depend on the container format, be an implementation detail, or something else?
… The answer touches on the other questions
… Next question is about gaps and sparse streams
… Is an empty cue box an MSE coded frame? Answer could depend on the previous question
… Other formats work differently. For example, there's 3GPP timed text which is commonly used in MP4, where gaps are encoded as cues with empty text
… If a browser wanted to support 3GPP timed text in MP4 (not unreasonable), could those gaps be cues?
… Also container formats. MP4 makes it easier to encode gaps than not to. In Matroska, that's not a problem, and implementations don't do it. Is that a problem for MSE? It causes some difficulties
… Gaps are also useful for audio and video. An audio gap is an intentionally silent section, and for video no new frame is played
… There are some use cases for MSE where gaps can be useful. We've talked about those before in MEIG meetings
… One is live playback. Where you have audio and video in separate SourceBuffers. For live streams you want to prioritise getting the latest information
… If you can't download the video in time, but you have the audio, you could continue playing the audio. Not covered in the MSE spec
… Other use case, if you want to insert an ad where you only have either audio or video, so you transition from audio+video to only audio or only video, and back. Gaps could also work in this case
… Also there's the problem of a buffer with only a text track. Buffered Ranges are computed from audio and video only. They assume text streams are sparse and have unannounced gaps
… With the current algorithms, the buffered range never grows, so playback cannot start
… In many cases, if you haven't buffered text, you don't want to play. Without explicit gaps, you can't do this, or only if there's also audio and video in the stream
… Now, consider cues that go across segment boundaries
… If we're splitting an MP4 file for adaptive streaming. using the source label and source id we can identify copies of the cue in different fragments
… The MSE spec doesn't specify extending cues at the moment, so doesn't describe how it should be handled, or if it's mandatory or a quality of implementation issue
… And how to present it to the user? Update the cue and emit an oncuechange event? The spec should clarify
… In WebM, the MSE bytestream spec doesn't describe it at all
… Are the representations that WebM and Matroska give us good enough?
… We could advocate changes in the IETF
… Next, embedded text tracks, common examples are CEA-608 and 708. Generally the problem is that we don't know we have them in advance. They appear inside SEI messages
… There's also ID3 timed text, which has a similar problem, it's in interleaved chunks between fragments
… It's been discussed before, but no support in MSE. Interesting to keep in mind
… Those are the questions I've identified so far that would be interesting to discuss as we try to mature the support for timed text tracks in MSE
Francois: It resonates with past experience in the Multi-Device Timing CG, where we worked on synchronising things on a timeline
… We realised there are scenarios where you want play/pause/seek, but no audio and video. The only way to do that now is create silent audio to attach a text track to it
… You can't just play a text track
Alicia: I'm working on the WebKit implementation. It's not working yet
… (media containers with only text track content)
Francois: So that part is under-specified in MSE. No-one was implementing it at the time
Alicia: I work on the GStreamer port. Apple also has support, in the technical preview
cpn: For emsg box, we were working with DASH-IF. They had an abstracted processing model for these event message tracks.
… There did not seem to be really a push from media industries to get that into browsers.
https://
Alicia: Also see timed text in CMAF spec from AOM
cpn: I think the processing was very similar to the processing you describe for WebVTT cues.
… When you create your media segment for downloading, they are duplicated across each segments, and identifiers help to relate them.
… We never went as far as defining the processing in MSE. That was the initial plan though. Just not enough push from other people at the time. That was a few years ago. I don't know if the situation has changed. There may be more interest today.
… I don't know the standardization progress in MPEG as well.
… I can certainly follow up on that.
… My suggestion is to take this to the Media WG, you may assume familiarity with WebVTT. Let's figure out if there's interest to do it!