16:01:39 <RRSAgent> RRSAgent has joined #mediawg
16:01:43 <RRSAgent> logging to https://www.w3.org/2025/06/17-mediawg-irc
16:01:47 <cpn> Meeting: Media WG
16:02:00 <cpn> Chair: Chris_Needham
16:02:35 <cpn> Present+ Chris_Needham, Eugene_Zemtsov, Alicia_Boya_Garcia, Scott_Kidder, Nigel_Megitt
16:02:54 <cpn> Present+ Nishitha_Dey
16:04:33 <markafoltz> markafoltz has joined #mediawg
16:04:39 <cpn> Present+ Mark_Foltz, Francois_Daoust
16:04:42 <alicia> alicia has joined #mediawg
16:05:38 <nigel> scribe+ nigel
16:05:45 <cpn> scribe+ cpn
16:06:03 <nigel> Agenda: https://github.com/w3c/media-wg/blob/main/meetings/2025-06-17-Media_Working_Group_Teleconference-agenda.md
16:06:39 <nigel> Topic: Accidental trimming of overlapping text cues
16:06:55 <nigel> https://github.com/w3c/media-source/issues/363 -> Github issue 363
16:07:28 <nigel> -> https://www.w3.org/2025/03/breakouts-day-2025/recordings/recording-14.html 2025 Breakouts day recording
16:08:12 <nigel> Slides: https://ntrrgc.github.io/w3c-breakouts-2025-mse-text-tracks/
16:08:30 <nigel> alicia: Unclear how many coded frames is a WebVTT Cue
16:08:44 <nigel> .. In WebVTT cues can overlap, and depending on the container format a cue might not equal one frame.
16:08:48 <nigel> .. This is the case for MP4.
16:09:05 <nigel> .. In WebM for example a cue is one frame because in WebM frames can overlap - the container allows it.
16:09:54 <nigel> .. The issue I reported, #363, is that depending on the interpretation a cue could get lost.
16:10:00 <nigel> .. I explained a case in which that would happen.
16:10:22 <nigel> .. Then Jer proposed to make a change to one of the frame processing steps to solve that particular issue.
16:10:45 <nigel> .. Then in the slides I had many other open questions.
16:11:01 <nigel> .. One of them has an open issue - about embedded text tracks how we could support them if we
16:11:04 <nigel> .. ever wanted to.
16:11:13 <nigel> cpn: Was that issue #358?
16:11:46 <nigel> alicia: It might be related but not the one I had in mind.
16:11:55 <nigel> .. #258 is old, from 2016!
16:12:04 <cpn> https://github.com/w3c/media-source/issues/58
16:12:40 <nigel> s/358/58
16:12:44 <nigel> s/258/58
16:12:52 <alicia> https://github.com/w3c/media-source/issues/58 describes how we could handle CEA-608
16:12:56 <nigel> alicia: This issue describes how we could handle CEA608
16:13:16 <nigel> .. The one you mentioned Chris...
16:13:36 <alicia> Chris mentioned https://github.com/w3c/media-source/issues/358, which is a more generic issue about text track formats in MSE
16:13:36 <nigel> cpn: #358 was actually Nigel's issue - the overall question of MSE and Text Track handling and interop.
16:14:08 <nigel> s/Was that issue #58/Was that issue #358
16:14:20 <nigel> alicia: We can discuss whichever of the many issues we have you want!
16:15:16 <cpn> Nigel: I'm familiar with timed text formats, but it's more complicated in this context. Cyril would be good to include
16:15:28 <nigel> Nigel: Cyril Concolato from Netflix
16:15:57 <nigel> i/Nigel: I/alicia: Who do we know who knows about text tracks in wrappers?
16:16:35 <nigel> alicia: I also talk about this a bit in the @@@ hackfest
16:16:46 <nigel> .. regarding what should be considered a coded frame it would make sense to have
16:16:51 <nigel> .. consistency amongst containers.
16:17:04 <nigel> .. Defining that e.g. one cue is one frame if possible would make sense.
16:17:55 <nigel> Nigel: Define "coded frame"? Containers talk about samples and segments etc
16:18:19 <nigel> Alicia: So far one coded frame in MSE seems to correspond to one sample in MP4 or one Matroska block
16:18:31 <nigel> .. The question is if we still want that to be the case?
16:18:32 <scott_kidder> scott_kidder has joined #mediawg
16:19:46 <cpn> Nigel: If you think about a sample of a TTML payload, it wouldn't correspond to a single cue, it would be multiple cues. So imposing one sample = one coded frame = one cue would be the wrong way to go
16:20:19 <cpn> ... One thing that could theoretically be done is if the TTML payload is delivered as multiple ISDs, but that would be a new spec. There's no spec for ISDs in containers
16:20:35 <cpn> ... The nice thing is they don't overlap
16:20:49 <cpn> Alicia: What you describe resembles how WebVTT works in MP4
16:21:19 <cpn> Nigel: I thought with WebVTT in MP4 you deliver each cue once with begin/end times, but that can have overlaps
16:21:48 <cpn> Alicia: Each sample has the data for all the cues, so cues are repeated when there are overlaps. So this can work fine
16:21:57 <cpn> ... Not sure how similar that is in TTML, I haven't studied that
16:22:38 <cpn> Nigel: The TTML model in MP4 is that the timeline is divided into a series of chunks, and there's a sample for each of those. Each is a document that describes what happens int he sample period
16:22:47 <cpn> ... So you have everything you need for the sample period.
16:23:21 <cpn> ... If there's any change to the appearance of the cues, that's one ISD
16:23:57 <nigel> cpn: What would we like to focus on, 608/708, cues, or a broader conversation like TTML etc?
16:24:15 <nigel> alicia: I'm okay with whatever, but one of the most important things is the definition of coded frame
16:24:24 <nigel> .. for text tracks, because it has lots of implications.
16:24:53 <nigel> .. Nigel, to clarify, the way things work in TTML in MP4, there are no changes of cues within one MP4 sample?
16:25:02 <cpn> Nigel: No, there are changes
16:26:24 <cpn> ... I need to understand MSE coded frames better. TTML samples in MP4 don't overlap each other
16:26:41 <cpn> Alicia: That's also the case for WebVTT samples in MP4. The one container where they do overlap is WebM
16:27:02 <cpn> ... I looked at source code, and found there isn't much support for overlapping cues in practice
16:27:14 <nigel> cpn: Which implementations have you looked at?
16:27:27 <nigel> alicia: WebKit / MacOS, because it is the one that I know has shipped
16:27:38 <nigel> .. I also was given the hint that Opera might have shipped for TV players
16:27:45 <nigel> .. I have not been able to confirm that.
16:28:16 <nigel> cpn: I think WebKit in terms of the mainstream desktop and mobile browsers is the only one that has this.
16:28:22 <nigel> alicia: And it is very recent
16:28:34 <nigel> cpn: I can imagine that there are implementations that take it and do custom TV implementations that
16:28:36 <nigel> .. may have this.
16:28:48 <nigel> alicia: Yes, which is why finding out about them would be very useful.
16:29:10 <nigel> cpn: Have you studied the MSE algorithms?
16:29:18 <nigel> .. It starts from the segment parser loop.
16:29:26 <nigel> alicia: Many times but I can't talk about them from memory!
16:29:38 <nigel> cpn: That's where it starts, then goes into coded frames, then the bytestream formats,
16:29:53 <nigel> .. and from the bytestream format spec it refers to the sourcing inband tracks document.
16:30:27 <nigel> alicia: [shares screen showing ISO BMFF Byte Stream Format]
16:30:40 <nigel> cpn: Nothing to mention in particular, I was just following the flow of specifications.
16:30:56 <nigel> .. There's a note in the MSE spec that says that Text Track handling is handled through the format
16:31:11 <nigel> .. registrations, and then here in this rather old document it explains how it works.
16:31:25 <nigel> alicia: There was an effort, I haven't had time to go through it.
16:31:41 <nigel> cpn: This contains a mixture of things that may be implemented and have never been implemented.
16:31:52 <nigel> .. It's not clear to me which parts of this are still accurate. and which parts are not.
16:32:16 <nigel> alicia: [Mapping Text Track content into text track cues] It mentions the 3GPP timed text format,
16:32:25 <nigel> .. one of the most common track formats for MP4 oddly enough.
16:32:49 <nigel> .. It has a particularity where, if you have no cue showing up, that is coded as a cue with empty content,
16:32:58 <nigel> .. and I was wondering how that would be coded into MSE.
16:33:10 <nigel> cpn: You may well find a gap - reading this document and the byte stream document,
16:33:18 <nigel> .. I don't know how well defined that is.
16:33:34 <nigel> .. Can you trace through all the specs from MSE down to get to the exact steps you need to follow?
16:33:37 <nigel> alicia: I should be able to
16:33:43 <nigel> cpn: That's the goal, we should be able to do that.
16:33:59 <nigel> alicia: Looking at this section because it looks like the most promising to answer
16:34:03 <nigel> .. the question about coded frames.
16:34:41 <nigel> .. Mentions the "yet to be defined TTMLCue"!
16:34:55 <nigel> cpn: That's what I meant, I'm not sure what's been implemented or even defined.
16:35:29 <nigel> alicia: Reads about TTML subtitle samples
16:36:25 <cpn> Nigel: The idea that you specify the start/end time of the intermediate document makes sense, but I'm not sure that maps to a single cue. The doc seems to assume that within the that interval there's one cue defined, not a valid assumption
16:36:36 <cpn> Alicia: That would be problematic, yes
16:36:48 <nigel> s/intermediate document/TTML document
16:37:23 <nigel> Present+ Eric_Carlson
16:38:39 <cpn> Nigel: This document is old and doesn't match my current understandings of how things would work, especially for TTML
16:39:20 <nigel> cpn: Coming back to Jer's very specific suggestion about a tweak to the coded frame algorithm.
16:39:32 <nigel> .. He suggests amending step 14.
16:39:44 <nigel> alicia: That would still be relevant no matter what we choose here, for the WebM case.
16:39:58 <nigel> .. And it would be relevant for all formats if we decide that one cue is one coded frame.
16:40:24 <cpn> Nigel: At the moment, coded frames never overlap temporally?
16:40:45 <cpn> Alicia: MSE spec does seem to assume possibility of overlap. I found it a bit ambiguous though
16:41:35 <nigel> alicia: [opens the MSE draft]
16:41:41 <nigel> cpn: There is a definition in MSE:
16:41:46 <cpn> Chris: A unit of media data that has a presentation timestamp, a decode timestamp, and a coded frame duration.
16:42:21 <cpn> ... For video and text, the duration indicates how long the video frame or text SHOULD be displayed
16:42:45 <nigel> alicia: That doesn't answer the question
16:42:48 <nigel> cpn: Agree, it doesn't
16:43:03 <nigel> .. The duration of how long a piece of text should be displayed is independent of that
16:43:15 <nigel> alicia: You could imagine 3 coded frames each with the same text and the same duration
16:43:23 <nigel> .. and that definition would still work the same.
16:43:56 <nigel> alicia: It doesn't seem that the ISO BMFF byte stream spec mentions coded frames
16:44:03 <nigel> cpn: No, it talks about segments
16:44:07 <nigel> alicia: That's different
16:44:09 <nigel> cpn: Yes
16:45:39 <cpn> Nigel: It seems like we need a mapping of concepts or terms that can be applied across different specs. That people use different terms in different specs isn't helping
16:46:00 <nigel> alicia: Yes, we're discussing this because the definition of MSE Coded Frames is insufficient.
16:46:11 <cpn> Alicia: The definition of MSE coded frame and WebVTT cue in MP4 isn't clear
16:46:12 <nigel> .. It doesn't explain the relationship to e.g. WebVTT Cues in MP4.
16:48:21 <nigel> cpn: I think step 2 is okay because it doesn't talk about the payload type i.e. video/audio/text
16:48:30 <nigel> .. and it doesn't say if one coded frame is 1 cue or not
16:48:35 <nigel> alicia: That makes sense for the base MSE spec
16:48:47 <nigel> .. The problem is that the ISO BMFF Byte Stream Format doc and it doesn't tell us
16:49:04 <nigel> .. what is a coded frame but sends us to the unofficial draft and that doesn't seem to answer it either.
16:49:23 <nigel> cpn: The ISO BMFF doc also doesn't describe a coded frame for audio or video
16:49:32 <nigel> alicia: ISO BMFF has the concept of sample
16:49:36 <nigel> s/sample/samples
16:49:57 <nigel> .. As far as I know coded frame is specifically an MSE term
16:50:02 <nigel> cpn: Yes I believe so
16:50:17 <nigel> .. Does MSE define what a sample is?
16:50:22 <nigel> alicia: I don't think so but let me check
16:50:36 <nigel> cpn: The ISO BMFF doc talks about sample - do we have a definition?
16:51:00 <nigel> alicia: I see the word sample used in the ISO BMFF spec and it's all about PCM samples
16:51:09 <nigel> Nigel: That doesn't sound right
16:51:18 <nigel> cpn: In the MSE spec that's what sample refers to
16:51:22 <nigel> Nigel: Oh, I see
16:51:39 <nigel> cpn: It's correct for the MSE spec but the ISO BMFF spec talks about samples in a different way
16:52:05 <nigel> alicia: Yes, that's why I call the audio ones PCM samples and I might refer to the MP4 ones as MP4 samples.
16:52:07 <nigel> cpn: Yes
16:52:27 <nigel> Nigel: Makes sense to me to qualify the usage of the terms
16:53:27 <nigel> cpn: [Reads from the byte stream spec]
16:54:01 <nigel> .. Do VTTCues and TTMLCues always have a start time and an end time when they're encoded into the MP4
16:54:14 <nigel> .. or is there a case where the end time is not known, and gets set later
16:55:29 <cpn> Nigel: A sample in MP4 always has a start time and an end time. If you have a cue that lasts over many MP4 samples, it's repeated, then you have an MP4 sample that shows that it ends
16:56:03 <cpn> ... For TTML if every sample is 1 second and you have a single piece of text lasting 3.5 seconds, you'd see that text in seconds 0-1, 1-2, 2-3, 3-3.5
16:56:33 <cpn> ... Then the document for the last one would have effectively more than one snapshot presentation in it, the thing you show for the first 0.5 seconds, then the latter
16:57:26 <cpn> ... This means that if you have a situation where text is created in real time, your encoder has to deal with that, and there's a latency involved
16:58:11 <cpn> ... For low latency applications there are schemes that let you deliver video frames before the end of the sample. But nobody is doing that for timed timed AFAIK
16:58:23 <nigel> s/timed timed/timed text
16:58:41 <nigel> cpn: I'm asking because there's this notion of coded frame with a known duration.
16:58:50 <nigel> .. Where do we go, we have 2 minutes of the call left?
16:58:57 <nigel> .. I want to help you get to the bottom of all of this!
16:59:17 <nigel> .. Given some people have dropped off, we are in a position where e.g. Chrome has no handling
16:59:31 <nigel> .. of timed text in media containers that I know of, and I don't know that they want to develop or
16:59:34 <nigel> .. implement that.
16:59:47 <nigel> .. As a WG there's a question of interop that we want to get to ideally with this.
16:59:56 <nigel> .. I think there are multiple questions.
17:00:10 <nigel> .. One is can we figure out the detail to make all of this consistent within your own implementation work
17:00:20 <nigel> .. and then separately how do we bring this to more implementations in general.
17:00:30 <nigel> .. That's something that my organisation is interested in getting us to,
17:00:39 <nigel> .. we would love to get this more widely supported.
17:00:58 <nigel> .. The difficulty is that engineers aren't motivated to figure it out if they're not implementing it.
17:01:19 <nigel> alicia: Another question is if it would be feasible to write a polyfill for appendBuffer
17:01:39 <nigel> eric: I think she was saying that it might be possible to do a polyfill to make it easier
17:01:45 <nigel> .. when there isn't native support.
17:01:53 <nigel> alicia: yes that's what I was trying to say
17:01:57 <nigel> cpn: Yes that would help
17:02:28 <nigel> alicia: One of the problems with the polyfill approach is how to make sure you don't get
17:02:37 <nigel> .. both the polyfill text track and the browser text track
17:02:51 <nigel> eric: We would need to figure out a way to feature detect, though I can't imagine
17:02:59 <nigel> .. how we would support feature detect for text tracks in MSE.
17:03:16 <nigel> alicia: The more general problem for this type of polyfill, that I noticed a few weeks ago,
17:03:33 <nigel> .. with some trailers from the iTunes store, and the MP4 files have CEA608 captions,
17:03:46 <nigel> .. but the player in the page assumed they couldn't be rendered and instead it fetched a
17:03:52 <nigel> .. separate out of band WebVTT track
17:04:09 <nigel> .. You could imagine that if we get in band text tracks working then the many polyfills that
17:04:19 <nigel> .. exist could conflict with the browser implementation.
17:04:34 <nigel> eric: Right, a polyfill could detect when text tracks are added and disable them itself,
17:04:46 <nigel> .. which is what the controls for the iTunes trailers should be doing but obviously aren't.
17:04:55 <nigel> alicia: I don't think you support CEA608 in MSE do you?
17:05:15 <nigel> eric: No, we do in MP4 and in transport streams but not in MSE
17:05:27 <nigel> .. That's via AVFoundation.
17:05:48 <nigel> cpn: We are over time, but this is quite a valuable conversation.
17:05:53 <nigel> .. What do people want to do?
17:05:58 <nigel> alicia: We've been talking for long enough
17:06:06 <nigel> cpn: I would like concrete next steps
17:06:27 <nigel> eric: Sounds like another meeting is needed, hopefully when I have a stable connection
17:06:45 <nigel> cpn: We mentioned asking Cyril. Is there anyone else? DASH-IF may have some expertise here.
17:06:54 <nigel> .. (Iraj at the time)
17:07:07 <nigel> eric: Gary might because of the polyfill work he did in his previous job
17:07:59 <nigel> cpn: Let's try to get the right people together and reconvene then.
17:08:24 <nigel> .. Having all this better defined would be a good thing
17:08:32 <nigel> .. Happy to keep the conversation going to let us do that
17:09:44 <nigel> nigel: Suggest offline planning.
17:09:55 <nigel> .. Alicia, anyone, if you have questions about TTML, please do ask me
17:10:07 <nigel> eric: Apologies for missing most of the meeting
17:10:17 <nigel> cpn: That's okay these things happen. Glad you're here.
17:10:28 <nigel> .. Meeting adjourned, thank you. Bye bye!
17:10:32 <nigel> rrsagent, make minutes
17:10:33 <RRSAgent> I have made the request to generate https://www.w3.org/2025/06/17-mediawg-minutes.html nigel
17:12:00 <cpn> rrsgent, make log public
17:15:19 <cpn> s/rrsgent, make log public//
17:15:25 <cpn> rrasgent, make log public
17:15:38 <cpn> s/rrasgent, make log public//
17:15:49 <cpn> rrsagent, make log public
18:55:40 <Zakim> Zakim has left #mediawg