Revisiting in-band text tracks in MediaSource Extensions

This page contains a video recording of the presentation made during Breakouts Day 2025, along with a transcript. Video captions and transcript were automatically generated and may not properly translate the speaker's speech. Please use GitHub to suggest corrections.

Video
Transcript

Video

Transcript

Alicia Boya Garcia: Hello, welcome to this W3C Breakout session. The goal here is to revisit the open questions regarding in-band text tracks in Media source extensions. My name is Alicia Bolly Garcia. I work for Igalia, and I'm part of the W3C Media and Entertainment Interest Group. First, I would like to remind all the participants of the participation policies of the breakouts. This year in particular, the code of ethics and professional conduct, and the anti-trust and competition guidance. The URLs for them are in the screen and in the slides that are already available in the Github issue, as well as in the page of the W3C breakouts session.

Alicia Boya Garcia: For this presentation I'm going to assume you have some familiarity with MSE. Also known as Media Source extensions. But I'm not going to assume a specific knowledge about text track formats. To clarify, when in this presentation, I will be talking about out of band text tracks, I am talking about formats that are purely for text tracks. They are often textual, and examples could be the SRT format or the WebVTT format. When I'm talking about in-band text tracks. I'm talking about text tracks that have been put in a media container file like MP4, Webm or Matroska. The same file because of this can also contain video and audio, but it also doesn't have to. And in fact, we have the example of Matroska, where this is common enough that it warrants its own extension. So mks files means that this is a container file used specifically just to handle text tracks, but no audio and video.

Alicia Boya Garcia: The agenda for the meeting is split into 2 parts. In the informative part I will be giving an introduction to WebVTT. In particular, the features that make implementation of in-bound tracks in MSE and in containers in general a bit more tricky. And I'll also explain how those representations inside container formats work.

Alicia Boya Garcia: Then I'll go through a lot of challenges and open questions regarding the implementation of text tracks in MSE. Many of which are consequences of container representations. But not all.

Alicia Boya Garcia: So let's start with WebVTT WebVTT is the web video text track formats. It has been supported in browsers for a long time. and it's a reasonable 1st target, because it's been quite successful in in adoption. So it would make sense that if we support it already in regular playback, and we also support it as an out of band text track format, including when the video is MSE, that we could also supported in band when it's part of the data and being downloaded and pushed into appendBuffer.

Alicia Boya Garcia: The basic syntax of WebVTT is something like this. It's intentionally inspired by the Subrib Format or SRT. We have a file header that identifies that this is indeed WebVTT. and then we have a series of so-called cues which have a start timestamp, an end timestamp, and content, which can be not only text, but it can also include markup for styling.

Alicia Boya Garcia: Each cue can also have settings. And here we have an example where we have customized the position of the cue and the alignment. Cues can also have IDs. These IDs can be used for styling, and they are also accessible to Javascript.

Alicia Boya Garcia: This is a very important difference with video in that when when you load extracts in web APIs, you are able to access the list of cues, its contents, and you have access to certain metadata like these IDs, so they can be retrieved and controlled programmatically. WebVTT also support comment blocks. These are not rendered, but they can be very important for people authoring text tracks. So container formats will generally try to preserve this.

Alicia Boya Garcia: WebVTT documents can contain a style sheet. The style sheet is global for the entire document, and must be placed at the beginning of the file before any cues.

Alicia Boya Garcia: Another feature that is a structure similar is regions. Regions allow to define a specific portions or regions of the video where cues will appear.

Alicia Boya Garcia: WebVTT allows timestamps... allows cues to overlap in time. So here we have an example where, imagine we are doing closed captions, so we don't have only the dialogue, but also textual representations of sound effects. and those often happen simultaneously. So we need to be able to show 2 cues at the same time. The format natively supports this. So we'll see container representations can be more complicated.

Alicia Boya Garcia: and maybe the other format... the other feature, is that inside a cue we can have delayed parts. For instance, the 1st cue here, the good evening", and "is anyone there" form part of the same cue. However, the second part is only shown at the specific timestamp between angle brackets.

Alicia Boya Garcia: That's it for the features we are going to consider about WebVTT. For being particularly of interest for containers, and therefore of consequence to MSE. If anyone have any questions to this point, you are very welcome to to ask.

Alicia Boya Garcia: no questions, I will continue so. Now let's look at inbound WebVTT. That's when we place it into inside a container format. and at the moment I am aware of 2 formats that support this. We have the ISO based media file format. Of which MP4 is derived from, and which I'm just going to call MP4. For the rest of the presentation to make it easier to pronounce, even though it's technically not correct.

Alicia Boya Garcia: And there and then there is also Webm and Matroska, similar to ISO based media file format. Webm is derived from Matroska. So this time is the opposite relationship. Webm is a subset of Matroska. We'll get to them, we'll get to them. Unfortunately, I'm not aware of any representation from for MPEG2-TS.

Alicia Boya Garcia: Let's look at how Webp in MP4 works. We have things split into 2 different specs. On one hand we have the base spec for Iso based media file format. But WebVTT encoding is specified in a different part, in part 30, along with TTML.

Alicia Boya Garcia: The Codec string should contain 'wvtt' to represent that this file will contain a WebVTT track, and in MSE. this is the string that you would also use in appendBuffer.

Alicia Boya Garcia: For the initialization segment which is before the moov box. We get WebVTTSampleEntry, which again, despite its name, it's more like codec configuration. and it contains 2 boxes inside. The 1st one WebVTTConfigurationVox contains the file header. This is the the 1st line with WebVTT and any style sheets, regions, and any other similar features that get introduced in the future in textual format. Then optionally, but we'll see importantly, we have SourceLabelBox. This is meant to be an opaque URI that uniquely identifies this WebVTT document. And the idea is that 2 different WebVTT documents will have different source labels and the same WebVTT document in different MP4 files, if we are cutting them into pieces, or whatever other manipulation, would still have the same source label.

Alicia Boya Garcia: For the media segment, the timing of the cues is handled entirely by the container, so the timestamps we saw in the textual WebVTT format are not stored inside the frames, or samples. Frame and sample mean the same in the context of MP4. Instead, it's handled like regular and video frames.

Alicia Boya Garcia: However, the biggest difference is because WebVTT allows several overlapping frames... several overlapping cues. But MP4 is not meant to be used that way, normally, the cues are split into continuous, non-overlapping frames. The frames are ISO BMFF boxes inside, and we'll see what they contain.

Alicia Boya Garcia: We'll have 2 types of frames. We'll have gaps which represent the absence of a cue for the duration of that frame, and we have no gaps where we will have one or more of VTT boxes and optionally, VTT additional boxes. These are currently used for notes, for comments. And they would be inserted for the adjacent cue, after which they appear. Now let's look inside VTTCueBox.

Alicia Boya Garcia: Remember, we can have multiple of this in a frame. For each one of them, we have optionally a source id box. And together with the source label. This allows us to uniquely identify a cue. Because cues are split into continuous, non-overlapping frames, the same cue may appear on multiple frames. And therefore we need a way to be able to tell apart that this cue box and this cue box actually represent the same cue and not 2 different cues that happen to have the same contents. This is what the source id box allows us to do, but the id of a cue is not just the source id, but also the combination. It's the combination of the source label and source id. so that if we mix and match different WebVTT documents that have been muxed to MP4, we don't accidentally treat 2 different cues from 2 different documents as the same, just because they happen to coincide in the same source ID.

Alicia Boya Garcia: And other than that, we mostly have different boxes with different parts of WebVTT cue. So payload contains the text of the cue with any markup settings contains the line of settings that I saw in an example before, if there is any. And CueTimeBox is used in the case of having queues with delayed parts. And the way it works in MP4 is that you write there the original start time of the queue, and the the decoder is supposed to use this time as reference to compute the actual times of the delayed parts. So that if there is an edit list, or in any other way the timestamps of the frames are changed, delayed parts still work.

Alicia Boya Garcia: Now for Webm, the 1st problem we find is that there are 2 competing representations. There is one from the Webm project that was proposed around 2012 at a point where WebVTT was still relatively immature. But it still got adoption by software. This is the one currently supported by ffmpeg for instance. As a consequence of coming so early, it didn't account for the file header in particular. This means that we cannot support CSS or regions. It also doesn't account for what to do with cues with delayed parts.

Alicia Boya Garcia: A later standard came from Matroska, which is the parent file format of Webm and the 1st draft I found is from 2018. And it actually addresses those 2 problems of the other representation. So the header actually goes in the CodecPrivate. >hich is similar to what we saw in MP4, and the encoding of cues with delayed parts is specified how it should be done. It uses a different method instead. you don't write the original timestamp for the time of the delayed part. You write the offset from the start of the frame.

Alicia Boya Garcia: But that's it, just a reasonable way of doing the same thing. Now, important differences between the webM representation and the MP4 Representation and all of these are common to both representations.

Alicia Boya Garcia: Here one cue equals one frame. Overlapping cues are encoded as overlapping frames. Matroska uses a different structure as that in MP4, and this is easily handled, at least via the container format. For similar reasons gaps are not explicitly encoded.

Alicia Boya Garcia: Also because we don't have any equivalent of cue IDs, there is no provision for how to join cue across split segment boundaries.

Alicia Boya Garcia: If you use, for instance, MKV tooling, which is a typical software for splitting Matroska files, it will just try to cut in a way that avoids cutting any cues.

Alicia Boya Garcia: And that's it for inbound representations of WebVTT. Are there any questions so far? It's a lot to tackle, and I'm not expecting it to be easy to process this quickly.

Alicia Boya Garcia: For the rest of this talk. I'm going to to go through several questions, related to MSE and test tracks, both with WebVTT as an example, but also talking about other topics that are tangentially related.

Alicia Boya Garcia: The questions I'm asking that are a bit open ended, and of particular interest are both bold and highlighted in green.

Alicia Boya Garcia: First one, how many code frames is a WebVTT cue. A coded frame in the MSE spec roughly corresponds to frames in a container? But, as we have seen, different containers handle this differently for the same format, in this case WebVTT.

Alicia Boya Garcia: What should be the case for WebVTT cue. Should we also have a one to one, one MSE coded frame, one WebVTT cue, like WebM and Matroska? Should it be depending on the container format? Should it be an implementation detail of that particular user agent? Or maybe something else. Maybe having them overlap as consequence is not something we want, and then we actually want something more like MP4, but for all formats. And how could that happen? There are many open questions here, and I don't expect easy answers. Any answer to this question is going to touch many other questions.

Alicia Boya Garcia: I'm going to continue. The next big question is about gaps and sparse streams. Let's consider the case of WebVTT in MP4, and you might remember we saw a VTTEmptyCueBox. That's what we put in a frame where we have a period where there is no cue. Is that an MSE coded frame? The answer to that could be very dependent on the previous question.

Alicia Boya Garcia: And thinking more about gaps, we might also want to consider that there are other formats that work differently. For instance, looking at text track formats. There is 3GPP Time text, which is very commonly using in MP4 way more common than WebVTT, where gaps are actually encoded as cues with empty text.

Alicia Boya Garcia: But since the format is a bit simpler, this is an acceptable solution for that particular format. If a browser wanted to support 3GPP Time Text in MP4, which might not be that unreasonable, given that it's not that of an uncommon format, could those gaps be cues? What would they be?

Alicia Boya Garcia: And of course we also have container formats.

Alicia Boya Garcia: MP4, due to its particular structure, makes it easier to encode gaps explicitly than to not do it. But in Matroska that's not a problem and the representations that currently exist don't do it. Is that a problem for MSE? It definitely causes some some difficulties.

Alicia Boya Garcia: And talking more about gaps. Gaps are not just a useful concept in text. But they also generalize to audio and video. An audio gap is an intentionally silent section because there is no data to be played, and a video gap is just Nn new frame is played, either you would see the continuation of the last frame or some replacement image. And this I'm talking in the abstract, not particular implementations.

Alicia Boya Garcia: Now, there are already some use cases for MSE where gaps could potentially be useful. and we have talked about them in previous W3C Media and entertainment interest group meetings. One of them is live playback. So for both of these, imagine you have audio and video in separate source buffers. And for the 1st case imagine that you are watching a sports event. In these kinds of events, you often prioritize having the latest information to getting all the information. For a player application, it's desirable that if for some reason you are unable to download a chunk of video on time, but you have the audio, at least you can play the audio. And this is something that could be implemented with gaps. But this is currently something that does not exist in the MSE spec. So I'm bringing this up because if it's relevant for text, it might also be relevant for other types.

Alicia Boya Garcia: The other use case is, imagine you again have separate audio and video. And you want to insert an ad or any other kind of interlude where you only have one of the 2. You need to transition from video and audio into a segment that is only video or potentially only audio. And then back. Gaps could also work for that potential case. In particular when tearing down the media resources not desired by the user.

Alicia Boya Garcia: And then, more related with gaps, we have the little problem of the source buffer with only a text track. This is currently de facto unsupported. And it's just a consequence of this lack of support of explicit gaps.

Alicia Boya Garcia: Buffer ranges are computed only from video and audio, because the current MSE spec assumes that text streams will be sparse and will have unannounced gaps. To compensate for this, the buffer ranges algorithm doesn't include text tracks in the computation. As a consequence, if you have a source buffer that only has text, the buffer ranges never grow, and you have a source buffer that never buffers. So playback cannot start.

Alicia Boya Garcia: We could talk about whether it could make sense to work around this by just not including any source buffers that only contain text tracks in the computations. But even this could also have some problems, because and for many use cases, if you have not buffered text, you don't want to play. Imagine you don't understand the language or you cannot hear it, you would want playback to wait for actually getting the subtitles or captions, so it could actually make sense to block playback for them in many circumstances.

Alicia Boya Garcia: But because we don't have explicit gaps. We cannot do this at the moment. or at least we can only do it if we have also audio and video in the same stream to guide us.

Alicia Boya Garcia: Now, abandoning a bit the thing about gaps frames, let's consider cues that go across segment boundaries. Imagine we are splitting some MP4 file for adaptive streaming. we would be cutting the file into segments and, thanks to the source label and source id, we can identify copies of the cue in different fragments, and we can tell that this cue from the new fragment is the same as a cue from a previous fragment, and just extend the cue. However, the MSE spec does not specify this at the moment, and it also, as a consequence, makes no mention of whether this would be or how it should be handled, and whether it's mandatory, or whether this is a quality of implementation issue.

Alicia Boya Garcia: and there is also a minor question of how should that extension of the cue be presented? I think the way that makes the most sense is to change the duration of the cue and/or the timestamps, and emit an oncuechange. But this is something the spec should clarify.

Alicia Boya Garcia: In webm we have these problems that we mentioned earlier. At the moment the MSE spec doesn't tackle them at all. We are not picking a side in whether we should be using the original WebM representation or the later Matroska, one that has fewer problems.

Alicia Boya Garcia: And it could be argued that maybe we should, but we could also discuss. This might be depending on conclusion from other discussions, whether the representations that WebM and Matroska give us are good enough. Whether we, for instance, we could want explicit gaps. And these are changes that we could advocate for, because these are open formats. You can actually write to the mailing list of IDF, and. for instance, add optional features, used in MSE, or even if they are found interesting or necessary for the spec, even require them, as we are already doing with them before.

Alicia Boya Garcia: Then there is a whole discussion about embedded extracts. There are 2 that I'm giving here as an example, but they are also common examples. One is CEA-608 and 708. And generally the problem with embedded extracts is that we don't know we have them in advance. They are a bit of a surprise. In the case of CEA they appear inside the video. And even when this is not the only way to represent these types of tracks, it's often done for compatibility reasons.

Alicia Boya Garcia: There is also ID3 time text which has a similar problem. It's not a structure as a track, but instead, as interleaved chunks between fragments in the case of MP4, or between transmissions in the case of MPEG2-TS. And this has been discussed before, but support has not arrived for MSE. It might be interesting to keep these things in mind.

Alicia Boya Garcia: And that's it. That was a lot. I'm sure that was very overwhelming. But these are the the problems and open questions that I have identified so far that could be interesting to discuss, as we try to mature the support of timed text tracks in media source extensions.

Alicia Boya Garcia: We can go back and discuss!