10:19:35 <RRSAgent> RRSAgent has joined #mse-text-tracks
10:19:40 <RRSAgent> logging to https://www.w3.org/2025/03/26-mse-text-tracks-irc
10:19:40 <breakout-bot> RRSAgent, do not leave
10:19:41 <breakout-bot> RRSAgent, this meeting spans midnight
10:19:41 <breakout-bot> RRSAgent, make logs public
10:19:42 <breakout-bot> Meeting: Revisiting in-band text tracks in MediaSource Extensions
10:19:42 <breakout-bot> Chair: ntrrgc
10:19:42 <breakout-bot> Agenda: https://github.com/w3c/breakouts-day-2025/issues/14
10:19:42 <Zakim> Zakim has joined #mse-text-tracks
10:19:43 <breakout-bot> Zakim, clear agenda
10:19:43 <Zakim> agenda cleared
10:19:43 <breakout-bot> Zakim, agenda+ Pick a scribe
10:19:44 <Zakim> agendum 1 added
10:19:44 <breakout-bot> Zakim, agenda+ Reminders: code of conduct, health policies, recorded session policy
10:19:44 <Zakim> agendum 2 added
10:19:44 <breakout-bot> Zakim, agenda+ Goal of this session
10:19:45 <Zakim> agendum 3 added
10:19:45 <breakout-bot> Zakim, agenda+ Discussion
10:19:45 <Zakim> agendum 4 added
10:19:45 <breakout-bot> Zakim, agenda+ Next steps / where discussion continues
10:19:46 <Zakim> agendum 5 added
10:19:46 <breakout-bot> Zakim, agenda+ Adjourn / Use IRC command: Zakim, end meeting
10:19:46 <Zakim> agendum 6 added
10:19:46 <breakout-bot> breakout-bot has left #mse-text-tracks
10:27:05 <tidoust> tidoust has joined #mse-text-tracks
19:32:02 <alicia> alicia has joined #mse-text-tracks
20:57:29 <cpn> cpn has joined #mse-text-tracks
21:11:24 <cpn> present+ Alicia_Boya_Garcia, Chris_Needham, Francois_Daoust
21:11:40 <cpn> scribe+ cpn
21:12:22 <cpn> Alicia: Open questions about text tracks in MSE
21:13:10 <cpn> ... I assume you're familiar with MSE, but not necessarily about text track formats
21:13:48 <cpn> ... Out of band formats like SRT and WebVTT, or inband formats in the media container such as MP4, WebM, Matroska
21:14:28 <cpn> ... I'll introduce WebVTT and features that make implementation of inband tracks tricky
21:14:45 <cpn> ... Challenges and open questions on implementation of text tracks in MSE
21:14:50 <cpn> Topic: WebVTT
21:15:31 <cpn> Alicia: This has been supported in browsers for a long time. It's a reasonable first target. We support it as an out of band text track format. We could also support it in-band when you call appendBuffer()
21:16:09 <cpn> ... The syntax is based on SRT. There are cues with start and end timestamps and content, which can include markup for styling
21:16:26 <cpn> ... Cues can have setting, to allow customisation of the cue or alignment
21:17:05 <cpn> ... When you load the cues, you have APIs to retrieve the cues and control them programmatically
21:17:15 <cpn> ... Comment blocks are useful when authoring text tracks
21:17:35 <cpn> ...  WebVTT documents can contain stylesheets, which must come before any cues in the file
21:17:50 <cpn> ... Regions allow you to define specific portions of the video where cues will appear
21:18:00 <cpn> ... WebVTT allows cues to overlap in time
21:18:32 <cpn> ... An example of closed captions, and the textual representation of sound effects, both of which can happen at the same time
21:19:09 <cpn> ... You can have delayed parts in a cue, using angle brackets
21:20:39 <cpn> ... Now let's look at in-band WebVTT, when you put it into a container format. Two formats: ISO BMFF, which I'll call MP4, and WebM / Matroska. WebM is a subset of Matroska
21:20:57 <cpn> ... I'm not aware of any representation in MP2TS
21:21:22 <cpn> ... For ISO BMFF, we have two specs. WebVTT is in Part 30, along with TTML
21:22:01 <cpn> ... The init segment, moov box, there's a WebVTT sample entry, codec configuration
21:22:53 <cpn> ... This has two boxes inside: one for the file header and style sheets, then (optionally) the source label box, a URI that uniquely identifies the WebVTT document
21:24:19 <cpn> ... For media segments, the timing of the cues is handled by the container. Cues are handled like regular video frames. The difference is because WebVTT allows overlapping cues, but MP4 isn't normally meant to be used that way, so the cues are split into non-overlapping frames
21:24:51 <cpn> ... Two types of frames: gaps, which represent the absence of a cue, and non-gaps, where you have vtt cue boxes and vtt additional boxes
21:25:43 <cpn> ... In vtt cue box, you have an optional source id box, which together with the source label box allows a cue to be uniquely identified
21:26:50 <cpn> ... If you mix and match different WebVTT documents that have been muxed into MP4 you can still uniquely identify the cues
21:27:45 <cpn> ... A cue time box is used for cues with delayed parts. You write the original start time of the cue, used as a reference to compute the time of the delayed parts. If there's an edit list, the delayed parts still work
21:27:56 <cpn> subtopic: WebM
21:28:24 <cpn> Alicia: There are two kinds of representations. One from when it was less mature, but has adoption, e.g., in ffmpeg
21:28:50 <cpn> ... As a consequence of being early, it doesn't support the file header, so we can't include stylesheets, also delayed parts
21:29:46 <cpn> ... The later draft, from 2018, on Matroska, addresses both those problems. Delayed parts are defined as the offset from the start of the frame
21:30:53 <cpn> ... Commonalities and differences between the MP4 and WebM representations. Timing is handled by both containers, but gaps aren't explicitly encoded.
21:31:16 <cpn> Topic: MSE
21:31:46 <cpn> i/... For ISO BMFF/subtopic: ISO BMFF/
21:32:15 <cpn> Alicia: Several questions with MSE and text tracks, and other related topics
21:32:51 <cpn> ... How many coded frames is a WebVTT cue?
21:33:37 <cpn> ... Should it depend on the container format, be an implementation detail, or something else?
21:34:15 <cpn> ... The answer touches on the other questions
21:34:42 <cpn> ... Next question is about gaps and sparse streams
21:35:02 <cpn> ... Is an empty cue box an MSE coded frame? Answer could depend on the previous question
21:35:33 <cpn> ... Other formats work differently. For example, there's 3GPP timed text which is commonly used in MP4, where gaps are encoded as cues with empty text
21:36:05 <cpn> ... If a browser wanted to support 3GPP timed text in MP4 (not unreasonable), could those gaps be cues?
21:36:44 <cpn> ... Also container formats. MP4 makes it easier to encode gaps than not to. In Matroska, that's not a problem, and implementations don't do it. Is that a problem for MSE? It causes some difficulties
21:37:17 <cpn> ... Gaps are also useful for audio and video. An audio gap is an intentionally silent section, and for video no new frame is played
21:37:44 <cpn> ... There are some use cases for MSE where gaps can be useful. We've talked about those before in MEIG meetings
21:38:22 <cpn> ... One is live playback. Where you have audio and video in separate SourceBuffers. For live streams you want to prioritise getting the latest information
21:38:52 <cpn> ... If you can't download the video in time, but you have the audio, you could continue playing the audio. Not covered in the MSE spec
21:39:30 <cpn> ... Other use case, if you want to insert an ad where you only have either audio or video, so you transition from audio+video to only audio or only video, and back. Gaps could also work in this case
21:40:28 <cpn> ... Also there's the problem of a buffer with only a text track. Buffered Ranges are computed from audio and video only. They assume text streams are sparse and have unannounced gaps
21:41:11 <cpn> ... With the current algorithms, the buffered range never grows, so playback cannot start
21:42:00 <cpn> ... In many cases, if you haven't buffered text, you don't want to play. Without explicit gaps, you can't do this, or only if there's also audio and video in the stream
21:42:28 <cpn> ... Now, consider cues that go across segment boundaries
21:43:02 <cpn> ... If we're splitting an MP4 file for adaptive streaming. using the source label and source id we can identify copies of the cue in different fragments
21:43:36 <cpn> ... The MSE spec doesn't specify extending cues at the moment, so doesn't describe how it should be handled, or if it's mandatory or a quality of implementation issue
21:44:03 <cpn> ... And how to present it to the user? Update the cue and emit an oncuechange event? The spec should clarify
21:44:29 <cpn> ... In WebM, the MSE bytestream spec doesn't describe it at all
21:45:11 <cpn> ... Are the representations that WebM and Matroska give us good enough?
21:45:25 <cpn> ... We could advocate changes in the IETF
21:46:20 <cpn> ... Next, embedded text tracks, common examples are CEA-608 and 708. Generally the problem is that we don't know we have them in advance. They appear inside SEI messages
21:46:55 <cpn> ... There's also ID3 timed text, which has a similar problem, it's in interleaved chunks between fragments
21:47:16 <cpn> ... It's been discussed before, but no support in MSE. Interesting to keep in mind
21:47:58 <cpn> ... Those are the questions I've identified so far that would be interesting to discuss as we try to mature the support for timed text tracks in MSE
21:48:45 <cpn> Francois: It resonates with past experience in the Multi-Device Timing CG, where we worked on synchronising things on a timeline
21:49:14 <cpn> ... We realised there are scenarios where you want play/pause/seek, but no audio and video. The only way to do that now is create silent audio to attach a text track to it
21:49:23 <cpn> ... You can't just play a text track
21:51:13 <cpn> Alicia: I'm working on the WebKit implementation. It's not working yet
21:51:38 <cpn> ... (media containers with only text track content)
21:52:14 <cpn> Francois: So that part is under-specified in MSE. No-one was implementing it at the time
21:53:20 <cpn> Alicia: I work on the GStreamer port. Apple also has support, in the technical preview
21:54:38 <tidoust> scribe+
21:55:11 <tidoust> cpn: For emsg box, we were working with DASH-IF. They had an abstracted processing model for these event message tracks.
21:55:29 <tidoust> ... There did not seem to be really a push from media industries to get that into browsers.
21:56:49 <cpn> https://dashif.org/docs/EventTimedMetadataProcessing-v1.0.2.pdf
21:57:09 <tidoust> Alicia: Also see timed text in CMAF spec from AOM
21:57:29 <tidoust> cpn: I think the processing was very similar to the processing you describe for WebVTT cues.
21:57:56 <tidoust> ... When you create your media segment for downloading, they are duplicated across each segments, and identifiers help to relate them.
21:58:59 <tidoust> ... We never went as far as defining the processing in MSE. That was the initial plan though. Just not enough push from other people at the time. That was a few years ago. I don't know if the situation has changed. There may be more interest today.
21:59:13 <tidoust> ... I don't know the standardization progress in MPEG as well.
21:59:20 <tidoust> ... I can certainly follow up on that.
22:00:27 <tidoust> ... My suggestion is to take this to the Media WG, you may assume familiarity with WebVTT. Let's figure out if there's interest to do it!
22:00:28 <tidoust> RRSAgent, draft minutes
22:00:29 <RRSAgent> I have made the request to generate https://www.w3.org/2025/03/26-mse-text-tracks-minutes.html tidoust
23:00:16 <tidoust> RRSAgent, bye
23:00:16 <RRSAgent> I see no action items