10:19:35 RRSAgent has joined #mse-text-tracks 10:19:40 logging to https://www.w3.org/2025/03/26-mse-text-tracks-irc 10:19:40 RRSAgent, do not leave 10:19:41 RRSAgent, this meeting spans midnight 10:19:41 RRSAgent, make logs public 10:19:42 Meeting: Revisiting in-band text tracks in MediaSource Extensions 10:19:42 Chair: ntrrgc 10:19:42 Agenda: https://github.com/w3c/breakouts-day-2025/issues/14 10:19:42 Zakim has joined #mse-text-tracks 10:19:43 Zakim, clear agenda 10:19:43 agenda cleared 10:19:43 Zakim, agenda+ Pick a scribe 10:19:44 agendum 1 added 10:19:44 Zakim, agenda+ Reminders: code of conduct, health policies, recorded session policy 10:19:44 agendum 2 added 10:19:44 Zakim, agenda+ Goal of this session 10:19:45 agendum 3 added 10:19:45 Zakim, agenda+ Discussion 10:19:45 agendum 4 added 10:19:45 Zakim, agenda+ Next steps / where discussion continues 10:19:46 agendum 5 added 10:19:46 Zakim, agenda+ Adjourn / Use IRC command: Zakim, end meeting 10:19:46 agendum 6 added 10:19:46 breakout-bot has left #mse-text-tracks 10:27:05 tidoust has joined #mse-text-tracks 19:32:02 alicia has joined #mse-text-tracks 20:57:29 cpn has joined #mse-text-tracks 21:11:24 present+ Alicia_Boya_Garcia, Chris_Needham, Francois_Daoust 21:11:40 scribe+ cpn 21:12:22 Alicia: Open questions about text tracks in MSE 21:13:10 ... I assume you're familiar with MSE, but not necessarily about text track formats 21:13:48 ... Out of band formats like SRT and WebVTT, or inband formats in the media container such as MP4, WebM, Matroska 21:14:28 ... I'll introduce WebVTT and features that make implementation of inband tracks tricky 21:14:45 ... Challenges and open questions on implementation of text tracks in MSE 21:14:50 Topic: WebVTT 21:15:31 Alicia: This has been supported in browsers for a long time. It's a reasonable first target. We support it as an out of band text track format. We could also support it in-band when you call appendBuffer() 21:16:09 ... The syntax is based on SRT. There are cues with start and end timestamps and content, which can include markup for styling 21:16:26 ... Cues can have setting, to allow customisation of the cue or alignment 21:17:05 ... When you load the cues, you have APIs to retrieve the cues and control them programmatically 21:17:15 ... Comment blocks are useful when authoring text tracks 21:17:35 ... WebVTT documents can contain stylesheets, which must come before any cues in the file 21:17:50 ... Regions allow you to define specific portions of the video where cues will appear 21:18:00 ... WebVTT allows cues to overlap in time 21:18:32 ... An example of closed captions, and the textual representation of sound effects, both of which can happen at the same time 21:19:09 ... You can have delayed parts in a cue, using angle brackets 21:20:39 ... Now let's look at in-band WebVTT, when you put it into a container format. Two formats: ISO BMFF, which I'll call MP4, and WebM / Matroska. WebM is a subset of Matroska 21:20:57 ... I'm not aware of any representation in MP2TS 21:21:22 ... For ISO BMFF, we have two specs. WebVTT is in Part 30, along with TTML 21:22:01 ... The init segment, moov box, there's a WebVTT sample entry, codec configuration 21:22:53 ... This has two boxes inside: one for the file header and style sheets, then (optionally) the source label box, a URI that uniquely identifies the WebVTT document 21:24:19 ... For media segments, the timing of the cues is handled by the container. Cues are handled like regular video frames. The difference is because WebVTT allows overlapping cues, but MP4 isn't normally meant to be used that way, so the cues are split into non-overlapping frames 21:24:51 ... Two types of frames: gaps, which represent the absence of a cue, and non-gaps, where you have vtt cue boxes and vtt additional boxes 21:25:43 ... In vtt cue box, you have an optional source id box, which together with the source label box allows a cue to be uniquely identified 21:26:50 ... If you mix and match different WebVTT documents that have been muxed into MP4 you can still uniquely identify the cues 21:27:45 ... A cue time box is used for cues with delayed parts. You write the original start time of the cue, used as a reference to compute the time of the delayed parts. If there's an edit list, the delayed parts still work 21:27:56 subtopic: WebM 21:28:24 Alicia: There are two kinds of representations. One from when it was less mature, but has adoption, e.g., in ffmpeg 21:28:50 ... As a consequence of being early, it doesn't support the file header, so we can't include stylesheets, also delayed parts 21:29:46 ... The later draft, from 2018, on Matroska, addresses both those problems. Delayed parts are defined as the offset from the start of the frame 21:30:53 ... Commonalities and differences between the MP4 and WebM representations. Timing is handled by both containers, but gaps aren't explicitly encoded. 21:31:16 Topic: MSE 21:31:46 i/... For ISO BMFF/subtopic: ISO BMFF/ 21:32:15 Alicia: Several questions with MSE and text tracks, and other related topics 21:32:51 ... How many coded frames is a WebVTT cue? 21:33:37 ... Should it depend on the container format, be an implementation detail, or something else? 21:34:15 ... The answer touches on the other questions 21:34:42 ... Next question is about gaps and sparse streams 21:35:02 ... Is an empty cue box an MSE coded frame? Answer could depend on the previous question 21:35:33 ... Other formats work differently. For example, there's 3GPP timed text which is commonly used in MP4, where gaps are encoded as cues with empty text 21:36:05 ... If a browser wanted to support 3GPP timed text in MP4 (not unreasonable), could those gaps be cues? 21:36:44 ... Also container formats. MP4 makes it easier to encode gaps than not to. In Matroska, that's not a problem, and implementations don't do it. Is that a problem for MSE? It causes some difficulties 21:37:17 ... Gaps are also useful for audio and video. An audio gap is an intentionally silent section, and for video no new frame is played 21:37:44 ... There are some use cases for MSE where gaps can be useful. We've talked about those before in MEIG meetings 21:38:22 ... One is live playback. Where you have audio and video in separate SourceBuffers. For live streams you want to prioritise getting the latest information 21:38:52 ... If you can't download the video in time, but you have the audio, you could continue playing the audio. Not covered in the MSE spec 21:39:30 ... Other use case, if you want to insert an ad where you only have either audio or video, so you transition from audio+video to only audio or only video, and back. Gaps could also work in this case 21:40:28 ... Also there's the problem of a buffer with only a text track. Buffered Ranges are computed from audio and video only. They assume text streams are sparse and have unannounced gaps 21:41:11 ... With the current algorithms, the buffered range never grows, so playback cannot start 21:42:00 ... In many cases, if you haven't buffered text, you don't want to play. Without explicit gaps, you can't do this, or only if there's also audio and video in the stream 21:42:28 ... Now, consider cues that go across segment boundaries 21:43:02 ... If we're splitting an MP4 file for adaptive streaming. using the source label and source id we can identify copies of the cue in different fragments 21:43:36 ... The MSE spec doesn't specify extending cues at the moment, so doesn't describe how it should be handled, or if it's mandatory or a quality of implementation issue 21:44:03 ... And how to present it to the user? Update the cue and emit an oncuechange event? The spec should clarify 21:44:29 ... In WebM, the MSE bytestream spec doesn't describe it at all 21:45:11 ... Are the representations that WebM and Matroska give us good enough? 21:45:25 ... We could advocate changes in the IETF 21:46:20 ... Next, embedded text tracks, common examples are CEA-608 and 708. Generally the problem is that we don't know we have them in advance. They appear inside SEI messages 21:46:55 ... There's also ID3 timed text, which has a similar problem, it's in interleaved chunks between fragments 21:47:16 ... It's been discussed before, but no support in MSE. Interesting to keep in mind 21:47:58 ... Those are the questions I've identified so far that would be interesting to discuss as we try to mature the support for timed text tracks in MSE 21:48:45 Francois: It resonates with past experience in the Multi-Device Timing CG, where we worked on synchronising things on a timeline 21:49:14 ... We realised there are scenarios where you want play/pause/seek, but no audio and video. The only way to do that now is create silent audio to attach a text track to it 21:49:23 ... You can't just play a text track 21:51:13 Alicia: I'm working on the WebKit implementation. It's not working yet 21:51:38 ... (media containers with only text track content) 21:52:14 Francois: So that part is under-specified in MSE. No-one was implementing it at the time 21:53:20 Alicia: I work on the GStreamer port. Apple also has support, in the technical preview 21:54:38 scribe+ 21:55:11 cpn: For emsg box, we were working with DASH-IF. They had an abstracted processing model for these event message tracks. 21:55:29 ... There did not seem to be really a push from media industries to get that into browsers. 21:56:49 https://dashif.org/docs/EventTimedMetadataProcessing-v1.0.2.pdf 21:57:09 Alicia: Also see timed text in CMAF spec from AOM 21:57:29 cpn: I think the processing was very similar to the processing you describe for WebVTT cues. 21:57:56 ... When you create your media segment for downloading, they are duplicated across each segments, and identifiers help to relate them. 21:58:59 ... We never went as far as defining the processing in MSE. That was the initial plan though. Just not enough push from other people at the time. That was a few years ago. I don't know if the situation has changed. There may be more interest today. 21:59:13 ... I don't know the standardization progress in MPEG as well. 21:59:20 ... I can certainly follow up on that. 22:00:27 ... My suggestion is to take this to the Media WG, you may assume familiarity with WebVTT. Let's figure out if there's interest to do it! 22:00:28 RRSAgent, draft minutes 22:00:29 I have made the request to generate https://www.w3.org/2025/03/26-mse-text-tracks-minutes.html tidoust 23:00:16 RRSAgent, bye 23:00:16 I see no action items