16:01:39 RRSAgent has joined #mediawg 16:01:43 logging to https://www.w3.org/2025/06/17-mediawg-irc 16:01:47 Meeting: Media WG 16:02:00 Chair: Chris_Needham 16:02:35 Present+ Chris_Needham, Eugene_Zemtsov, Alicia_Boya_Garcia, Scott_Kidder, Nigel_Megitt 16:02:54 Present+ Nishitha_Dey 16:04:33 markafoltz has joined #mediawg 16:04:39 Present+ Mark_Foltz, Francois_Daoust 16:04:42 alicia has joined #mediawg 16:05:38 scribe+ nigel 16:05:45 scribe+ cpn 16:06:03 Agenda: https://github.com/w3c/media-wg/blob/main/meetings/2025-06-17-Media_Working_Group_Teleconference-agenda.md 16:06:39 Topic: Accidental trimming of overlapping text cues 16:06:55 https://github.com/w3c/media-source/issues/363 -> Github issue 363 16:07:28 -> https://www.w3.org/2025/03/breakouts-day-2025/recordings/recording-14.html 2025 Breakouts day recording 16:08:12 Slides: https://ntrrgc.github.io/w3c-breakouts-2025-mse-text-tracks/ 16:08:30 alicia: Unclear how many coded frames is a WebVTT Cue 16:08:44 .. In WebVTT cues can overlap, and depending on the container format a cue might not equal one frame. 16:08:48 .. This is the case for MP4. 16:09:05 .. In WebM for example a cue is one frame because in WebM frames can overlap - the container allows it. 16:09:54 .. The issue I reported, #363, is that depending on the interpretation a cue could get lost. 16:10:00 .. I explained a case in which that would happen. 16:10:22 .. Then Jer proposed to make a change to one of the frame processing steps to solve that particular issue. 16:10:45 .. Then in the slides I had many other open questions. 16:11:01 .. One of them has an open issue - about embedded text tracks how we could support them if we 16:11:04 .. ever wanted to. 16:11:13 cpn: Was that issue #358? 16:11:46 alicia: It might be related but not the one I had in mind. 16:11:55 .. #258 is old, from 2016! 16:12:04 https://github.com/w3c/media-source/issues/58 16:12:40 s/358/58 16:12:44 s/258/58 16:12:52 https://github.com/w3c/media-source/issues/58 describes how we could handle CEA-608 16:12:56 alicia: This issue describes how we could handle CEA608 16:13:16 .. The one you mentioned Chris... 16:13:36 Chris mentioned https://github.com/w3c/media-source/issues/358, which is a more generic issue about text track formats in MSE 16:13:36 cpn: #358 was actually Nigel's issue - the overall question of MSE and Text Track handling and interop. 16:14:08 s/Was that issue #58/Was that issue #358 16:14:20 alicia: We can discuss whichever of the many issues we have you want! 16:15:16 Nigel: I'm familiar with timed text formats, but it's more complicated in this context. Cyril would be good to include 16:15:28 Nigel: Cyril Concolato from Netflix 16:15:57 i/Nigel: I/alicia: Who do we know who knows about text tracks in wrappers? 16:16:35 alicia: I also talk about this a bit in the @@@ hackfest 16:16:46 .. regarding what should be considered a coded frame it would make sense to have 16:16:51 .. consistency amongst containers. 16:17:04 .. Defining that e.g. one cue is one frame if possible would make sense. 16:17:55 Nigel: Define "coded frame"? Containers talk about samples and segments etc 16:18:19 Alicia: So far one coded frame in MSE seems to correspond to one sample in MP4 or one Matroska block 16:18:31 .. The question is if we still want that to be the case? 16:18:32 scott_kidder has joined #mediawg 16:19:46 Nigel: If you think about a sample of a TTML payload, it wouldn't correspond to a single cue, it would be multiple cues. So imposing one sample = one coded frame = one cue would be the wrong way to go 16:20:19 ... One thing that could theoretically be done is if the TTML payload is delivered as multiple ISDs, but that would be a new spec. There's no spec for ISDs in containers 16:20:35 ... The nice thing is they don't overlap 16:20:49 Alicia: What you describe resembles how WebVTT works in MP4 16:21:19 Nigel: I thought with WebVTT in MP4 you deliver each cue once with begin/end times, but that can have overlaps 16:21:48 Alicia: Each sample has the data for all the cues, so cues are repeated when there are overlaps. So this can work fine 16:21:57 ... Not sure how similar that is in TTML, I haven't studied that 16:22:38 Nigel: The TTML model in MP4 is that the timeline is divided into a series of chunks, and there's a sample for each of those. Each is a document that describes what happens int he sample period 16:22:47 ... So you have everything you need for the sample period. 16:23:21 ... If there's any change to the appearance of the cues, that's one ISD 16:23:57 cpn: What would we like to focus on, 608/708, cues, or a broader conversation like TTML etc? 16:24:15 alicia: I'm okay with whatever, but one of the most important things is the definition of coded frame 16:24:24 .. for text tracks, because it has lots of implications. 16:24:53 .. Nigel, to clarify, the way things work in TTML in MP4, there are no changes of cues within one MP4 sample? 16:25:02 Nigel: No, there are changes 16:26:24 ... I need to understand MSE coded frames better. TTML samples in MP4 don't overlap each other 16:26:41 Alicia: That's also the case for WebVTT samples in MP4. The one container where they do overlap is WebM 16:27:02 ... I looked at source code, and found there isn't much support for overlapping cues in practice 16:27:14 cpn: Which implementations have you looked at? 16:27:27 alicia: WebKit / MacOS, because it is the one that I know has shipped 16:27:38 .. I also was given the hint that Opera might have shipped for TV players 16:27:45 .. I have not been able to confirm that. 16:28:16 cpn: I think WebKit in terms of the mainstream desktop and mobile browsers is the only one that has this. 16:28:22 alicia: And it is very recent 16:28:34 cpn: I can imagine that there are implementations that take it and do custom TV implementations that 16:28:36 .. may have this. 16:28:48 alicia: Yes, which is why finding out about them would be very useful. 16:29:10 cpn: Have you studied the MSE algorithms? 16:29:18 .. It starts from the segment parser loop. 16:29:26 alicia: Many times but I can't talk about them from memory! 16:29:38 cpn: That's where it starts, then goes into coded frames, then the bytestream formats, 16:29:53 .. and from the bytestream format spec it refers to the sourcing inband tracks document. 16:30:27 alicia: [shares screen showing ISO BMFF Byte Stream Format] 16:30:40 cpn: Nothing to mention in particular, I was just following the flow of specifications. 16:30:56 .. There's a note in the MSE spec that says that Text Track handling is handled through the format 16:31:11 .. registrations, and then here in this rather old document it explains how it works. 16:31:25 alicia: There was an effort, I haven't had time to go through it. 16:31:41 cpn: This contains a mixture of things that may be implemented and have never been implemented. 16:31:52 .. It's not clear to me which parts of this are still accurate. and which parts are not. 16:32:16 alicia: [Mapping Text Track content into text track cues] It mentions the 3GPP timed text format, 16:32:25 .. one of the most common track formats for MP4 oddly enough. 16:32:49 .. It has a particularity where, if you have no cue showing up, that is coded as a cue with empty content, 16:32:58 .. and I was wondering how that would be coded into MSE. 16:33:10 cpn: You may well find a gap - reading this document and the byte stream document, 16:33:18 .. I don't know how well defined that is. 16:33:34 .. Can you trace through all the specs from MSE down to get to the exact steps you need to follow? 16:33:37 alicia: I should be able to 16:33:43 cpn: That's the goal, we should be able to do that. 16:33:59 alicia: Looking at this section because it looks like the most promising to answer 16:34:03 .. the question about coded frames. 16:34:41 .. Mentions the "yet to be defined TTMLCue"! 16:34:55 cpn: That's what I meant, I'm not sure what's been implemented or even defined. 16:35:29 alicia: Reads about TTML subtitle samples 16:36:25 Nigel: The idea that you specify the start/end time of the intermediate document makes sense, but I'm not sure that maps to a single cue. The doc seems to assume that within the that interval there's one cue defined, not a valid assumption 16:36:36 Alicia: That would be problematic, yes 16:36:48 s/intermediate document/TTML document 16:37:23 Present+ Eric_Carlson 16:38:39 Nigel: This document is old and doesn't match my current understandings of how things would work, especially for TTML 16:39:20 cpn: Coming back to Jer's very specific suggestion about a tweak to the coded frame algorithm. 16:39:32 .. He suggests amending step 14. 16:39:44 alicia: That would still be relevant no matter what we choose here, for the WebM case. 16:39:58 .. And it would be relevant for all formats if we decide that one cue is one coded frame. 16:40:24 Nigel: At the moment, coded frames never overlap temporally? 16:40:45 Alicia: MSE spec does seem to assume possibility of overlap. I found it a bit ambiguous though 16:41:35 alicia: [opens the MSE draft] 16:41:41 cpn: There is a definition in MSE: 16:41:46 Chris: A unit of media data that has a presentation timestamp, a decode timestamp, and a coded frame duration. 16:42:21 ... For video and text, the duration indicates how long the video frame or text SHOULD be displayed 16:42:45 alicia: That doesn't answer the question 16:42:48 cpn: Agree, it doesn't 16:43:03 .. The duration of how long a piece of text should be displayed is independent of that 16:43:15 alicia: You could imagine 3 coded frames each with the same text and the same duration 16:43:23 .. and that definition would still work the same. 16:43:56 alicia: It doesn't seem that the ISO BMFF byte stream spec mentions coded frames 16:44:03 cpn: No, it talks about segments 16:44:07 alicia: That's different 16:44:09 cpn: Yes 16:45:39 Nigel: It seems like we need a mapping of concepts or terms that can be applied across different specs. That people use different terms in different specs isn't helping 16:46:00 alicia: Yes, we're discussing this because the definition of MSE Coded Frames is insufficient. 16:46:11 Alicia: The definition of MSE coded frame and WebVTT cue in MP4 isn't clear 16:46:12 .. It doesn't explain the relationship to e.g. WebVTT Cues in MP4. 16:48:21 cpn: I think step 2 is okay because it doesn't talk about the payload type i.e. video/audio/text 16:48:30 .. and it doesn't say if one coded frame is 1 cue or not 16:48:35 alicia: That makes sense for the base MSE spec 16:48:47 .. The problem is that the ISO BMFF Byte Stream Format doc and it doesn't tell us 16:49:04 .. what is a coded frame but sends us to the unofficial draft and that doesn't seem to answer it either. 16:49:23 cpn: The ISO BMFF doc also doesn't describe a coded frame for audio or video 16:49:32 alicia: ISO BMFF has the concept of sample 16:49:36 s/sample/samples 16:49:57 .. As far as I know coded frame is specifically an MSE term 16:50:02 cpn: Yes I believe so 16:50:17 .. Does MSE define what a sample is? 16:50:22 alicia: I don't think so but let me check 16:50:36 cpn: The ISO BMFF doc talks about sample - do we have a definition? 16:51:00 alicia: I see the word sample used in the ISO BMFF spec and it's all about PCM samples 16:51:09 Nigel: That doesn't sound right 16:51:18 cpn: In the MSE spec that's what sample refers to 16:51:22 Nigel: Oh, I see 16:51:39 cpn: It's correct for the MSE spec but the ISO BMFF spec talks about samples in a different way 16:52:05 alicia: Yes, that's why I call the audio ones PCM samples and I might refer to the MP4 ones as MP4 samples. 16:52:07 cpn: Yes 16:52:27 Nigel: Makes sense to me to qualify the usage of the terms 16:53:27 cpn: [Reads from the byte stream spec] 16:54:01 .. Do VTTCues and TTMLCues always have a start time and an end time when they're encoded into the MP4 16:54:14 .. or is there a case where the end time is not known, and gets set later 16:55:29 Nigel: A sample in MP4 always has a start time and an end time. If you have a cue that lasts over many MP4 samples, it's repeated, then you have an MP4 sample that shows that it ends 16:56:03 ... For TTML if every sample is 1 second and you have a single piece of text lasting 3.5 seconds, you'd see that text in seconds 0-1, 1-2, 2-3, 3-3.5 16:56:33 ... Then the document for the last one would have effectively more than one snapshot presentation in it, the thing you show for the first 0.5 seconds, then the latter 16:57:26 ... This means that if you have a situation where text is created in real time, your encoder has to deal with that, and there's a latency involved 16:58:11 ... For low latency applications there are schemes that let you deliver video frames before the end of the sample. But nobody is doing that for timed timed AFAIK 16:58:23 s/timed timed/timed text 16:58:41 cpn: I'm asking because there's this notion of coded frame with a known duration. 16:58:50 .. Where do we go, we have 2 minutes of the call left? 16:58:57 .. I want to help you get to the bottom of all of this! 16:59:17 .. Given some people have dropped off, we are in a position where e.g. Chrome has no handling 16:59:31 .. of timed text in media containers that I know of, and I don't know that they want to develop or 16:59:34 .. implement that. 16:59:47 .. As a WG there's a question of interop that we want to get to ideally with this. 16:59:56 .. I think there are multiple questions. 17:00:10 .. One is can we figure out the detail to make all of this consistent within your own implementation work 17:00:20 .. and then separately how do we bring this to more implementations in general. 17:00:30 .. That's something that my organisation is interested in getting us to, 17:00:39 .. we would love to get this more widely supported. 17:00:58 .. The difficulty is that engineers aren't motivated to figure it out if they're not implementing it. 17:01:19 alicia: Another question is if it would be feasible to write a polyfill for appendBuffer 17:01:39 eric: I think she was saying that it might be possible to do a polyfill to make it easier 17:01:45 .. when there isn't native support. 17:01:53 alicia: yes that's what I was trying to say 17:01:57 cpn: Yes that would help 17:02:28 alicia: One of the problems with the polyfill approach is how to make sure you don't get 17:02:37 .. both the polyfill text track and the browser text track 17:02:51 eric: We would need to figure out a way to feature detect, though I can't imagine 17:02:59 .. how we would support feature detect for text tracks in MSE. 17:03:16 alicia: The more general problem for this type of polyfill, that I noticed a few weeks ago, 17:03:33 .. with some trailers from the iTunes store, and the MP4 files have CEA608 captions, 17:03:46 .. but the player in the page assumed they couldn't be rendered and instead it fetched a 17:03:52 .. separate out of band WebVTT track 17:04:09 .. You could imagine that if we get in band text tracks working then the many polyfills that 17:04:19 .. exist could conflict with the browser implementation. 17:04:34 eric: Right, a polyfill could detect when text tracks are added and disable them itself, 17:04:46 .. which is what the controls for the iTunes trailers should be doing but obviously aren't. 17:04:55 alicia: I don't think you support CEA608 in MSE do you? 17:05:15 eric: No, we do in MP4 and in transport streams but not in MSE 17:05:27 .. That's via AVFoundation. 17:05:48 cpn: We are over time, but this is quite a valuable conversation. 17:05:53 .. What do people want to do? 17:05:58 alicia: We've been talking for long enough 17:06:06 cpn: I would like concrete next steps 17:06:27 eric: Sounds like another meeting is needed, hopefully when I have a stable connection 17:06:45 cpn: We mentioned asking Cyril. Is there anyone else? DASH-IF may have some expertise here. 17:06:54 .. (Iraj at the time) 17:07:07 eric: Gary might because of the polyfill work he did in his previous job 17:07:59 cpn: Let's try to get the right people together and reconvene then. 17:08:24 .. Having all this better defined would be a good thing 17:08:32 .. Happy to keep the conversation going to let us do that 17:09:44 nigel: Suggest offline planning. 17:09:55 .. Alicia, anyone, if you have questions about TTML, please do ask me 17:10:07 eric: Apologies for missing most of the meeting 17:10:17 cpn: That's okay these things happen. Glad you're here. 17:10:28 .. Meeting adjourned, thank you. Bye bye! 17:10:32 rrsagent, make minutes 17:10:33 I have made the request to generate https://www.w3.org/2025/06/17-mediawg-minutes.html nigel 17:12:00 rrsgent, make log public 17:15:19 s/rrsgent, make log public// 17:15:25 rrasgent, make log public 17:15:38 s/rrasgent, make log public// 17:15:49 rrsagent, make log public 18:55:40 Zakim has left #mediawg