Media WG – 17 June 2025

Meeting minutes

Accidental trimming of overlapping text cues

2025 Breakouts day recording

Slideset: https://ntrrgc.github.io/w3c-breakouts-2025-mse-text-tracks/

alicia: Unclear how many coded frames is a WebVTT Cue
… In WebVTT cues can overlap, and depending on the container format a cue might not equal one frame.
… This is the case for MP4.
… In WebM for example a cue is one frame because in WebM frames can overlap - the container allows it.
… The issue I reported, #363, is that depending on the interpretation a cue could get lost.
… I explained a case in which that would happen.
… Then Jer proposed to make a change to one of the frame processing steps to solve that particular issue.
… Then in the slides I had many other open questions.
… One of them has an open issue - about embedded text tracks how we could support them if we
… ever wanted to.

cpn: Was that issue #358?

alicia: It might be related but not the one I had in mind.
… #58 is old, from 2016!

w3c/media-source#58

<alicia> w3c/media-source#58 describes how we could handle CEA-608

alicia: This issue describes how we could handle CEA608
… The one you mentioned Chris...

<alicia> Chris mentioned https://github.com/w3c/media-source/issues/358, which is a more generic issue about text track formats in MSE

cpn: #358 was actually Nigel's issue - the overall question of MSE and Text Track handling and interop.

alicia: We can discuss whichever of the many issues we have you want!

alicia: Who do we know who knows about text tracks in wrappers?

Nigel: I'm familiar with timed text formats, but it's more complicated in this context. Cyril would be good to include

Nigel: Cyril Concolato from Netflix

alicia: I also talk about this a bit in the @@@ hackfest
… regarding what should be considered a coded frame it would make sense to have
… consistency amongst containers.
… Defining that e.g. one cue is one frame if possible would make sense.

Nigel: Define "coded frame"? Containers talk about samples and segments etc

Alicia: So far one coded frame in MSE seems to correspond to one sample in MP4 or one Matroska block
… The question is if we still want that to be the case?

Nigel: If you think about a sample of a TTML payload, it wouldn't correspond to a single cue, it would be multiple cues. So imposing one sample = one coded frame = one cue would be the wrong way to go
… One thing that could theoretically be done is if the TTML payload is delivered as multiple ISDs, but that would be a new spec. There's no spec for ISDs in containers
… The nice thing is they don't overlap

Alicia: What you describe resembles how WebVTT works in MP4

Nigel: I thought with WebVTT in MP4 you deliver each cue once with begin/end times, but that can have overlaps

Alicia: Each sample has the data for all the cues, so cues are repeated when there are overlaps. So this can work fine
… Not sure how similar that is in TTML, I haven't studied that

Nigel: The TTML model in MP4 is that the timeline is divided into a series of chunks, and there's a sample for each of those. Each is a document that describes what happens int he sample period
… So you have everything you need for the sample period.
… If there's any change to the appearance of the cues, that's one ISD

cpn: What would we like to focus on, 608/708, cues, or a broader conversation like TTML etc?

alicia: I'm okay with whatever, but one of the most important things is the definition of coded frame
… for text tracks, because it has lots of implications.
… Nigel, to clarify, the way things work in TTML in MP4, there are no changes of cues within one MP4 sample?

Nigel: No, there are changes
… I need to understand MSE coded frames better. TTML samples in MP4 don't overlap each other

Alicia: That's also the case for WebVTT samples in MP4. The one container where they do overlap is WebM
… I looked at source code, and found there isn't much support for overlapping cues in practice

cpn: Which implementations have you looked at?

alicia: WebKit / MacOS, because it is the one that I know has shipped
… I also was given the hint that Opera might have shipped for TV players
… I have not been able to confirm that.

cpn: I think WebKit in terms of the mainstream desktop and mobile browsers is the only one that has this.

alicia: And it is very recent

cpn: I can imagine that there are implementations that take it and do custom TV implementations that
… may have this.

alicia: Yes, which is why finding out about them would be very useful.

cpn: Have you studied the MSE algorithms?
… It starts from the segment parser loop.

alicia: Many times but I can't talk about them from memory!

cpn: That's where it starts, then goes into coded frames, then the bytestream formats,
… and from the bytestream format spec it refers to the sourcing inband tracks document.

alicia: [shares screen showing ISO BMFF Byte Stream Format]

cpn: Nothing to mention in particular, I was just following the flow of specifications.
… There's a note in the MSE spec that says that Text Track handling is handled through the format
… registrations, and then here in this rather old document it explains how it works.

alicia: There was an effort, I haven't had time to go through it.

cpn: This contains a mixture of things that may be implemented and have never been implemented.
… It's not clear to me which parts of this are still accurate. and which parts are not.

alicia: [Mapping Text Track content into text track cues] It mentions the 3GPP timed text format,
… one of the most common track formats for MP4 oddly enough.
… It has a particularity where, if you have no cue showing up, that is coded as a cue with empty content,
… and I was wondering how that would be coded into MSE.

cpn: You may well find a gap - reading this document and the byte stream document,
… I don't know how well defined that is.
… Can you trace through all the specs from MSE down to get to the exact steps you need to follow?

alicia: I should be able to

cpn: That's the goal, we should be able to do that.

alicia: Looking at this section because it looks like the most promising to answer
… the question about coded frames.
… Mentions the "yet to be defined TTMLCue"!

cpn: That's what I meant, I'm not sure what's been implemented or even defined.

alicia: Reads about TTML subtitle samples

Nigel: The idea that you specify the start/end time of the TTML document makes sense, but I'm not sure that maps to a single cue. The doc seems to assume that within the that interval there's one cue defined, not a valid assumption

Alicia: That would be problematic, yes

Nigel: This document is old and doesn't match my current understandings of how things would work, especially for TTML

cpn: Coming back to Jer's very specific suggestion about a tweak to the coded frame algorithm.
… He suggests amending step 14.

alicia: That would still be relevant no matter what we choose here, for the WebM case.
… And it would be relevant for all formats if we decide that one cue is one coded frame.

Nigel: At the moment, coded frames never overlap temporally?

Alicia: MSE spec does seem to assume possibility of overlap. I found it a bit ambiguous though

alicia: [opens the MSE draft]

cpn: There is a definition in MSE:

Chris: A unit of media data that has a presentation timestamp, a decode timestamp, and a coded frame duration.
… For video and text, the duration indicates how long the video frame or text SHOULD be displayed

alicia: That doesn't answer the question

cpn: Agree, it doesn't
… The duration of how long a piece of text should be displayed is independent of that

alicia: You could imagine 3 coded frames each with the same text and the same duration
… and that definition would still work the same.

alicia: It doesn't seem that the ISO BMFF byte stream spec mentions coded frames

cpn: No, it talks about segments

alicia: That's different

cpn: Yes

Nigel: It seems like we need a mapping of concepts or terms that can be applied across different specs. That people use different terms in different specs isn't helping

alicia: Yes, we're discussing this because the definition of MSE Coded Frames is insufficient.

Alicia: The definition of MSE coded frame and WebVTT cue in MP4 isn't clear

alicia: It doesn't explain the relationship to e.g. WebVTT Cues in MP4.

cpn: I think step 2 is okay because it doesn't talk about the payload type i.e. video/audio/text
… and it doesn't say if one coded frame is 1 cue or not

alicia: That makes sense for the base MSE spec
… The problem is that the ISO BMFF Byte Stream Format doc and it doesn't tell us
… what is a coded frame but sends us to the unofficial draft and that doesn't seem to answer it either.

cpn: The ISO BMFF doc also doesn't describe a coded frame for audio or video

alicia: ISO BMFF has the concept of samples
… As far as I know coded frame is specifically an MSE term

cpn: Yes I believe so
… Does MSE define what a sample is?

alicia: I don't think so but let me check

cpn: The ISO BMFF doc talks about sample - do we have a definition?

alicia: I see the word sample used in the ISO BMFF spec and it's all about PCM samples

Nigel: That doesn't sound right

cpn: In the MSE spec that's what sample refers to

Nigel: Oh, I see

cpn: It's correct for the MSE spec but the ISO BMFF spec talks about samples in a different way

alicia: Yes, that's why I call the audio ones PCM samples and I might refer to the MP4 ones as MP4 samples.

cpn: Yes

Nigel: Makes sense to me to qualify the usage of the terms

cpn: [Reads from the byte stream spec]
… Do VTTCues and TTMLCues always have a start time and an end time when they're encoded into the MP4
… or is there a case where the end time is not known, and gets set later

Nigel: A sample in MP4 always has a start time and an end time. If you have a cue that lasts over many MP4 samples, it's repeated, then you have an MP4 sample that shows that it ends
… For TTML if every sample is 1 second and you have a single piece of text lasting 3.5 seconds, you'd see that text in seconds 0-1, 1-2, 2-3, 3-3.5
… Then the document for the last one would have effectively more than one snapshot presentation in it, the thing you show for the first 0.5 seconds, then the latter
… This means that if you have a situation where text is created in real time, your encoder has to deal with that, and there's a latency involved
… For low latency applications there are schemes that let you deliver video frames before the end of the sample. But nobody is doing that for timed text AFAIK

cpn: I'm asking because there's this notion of coded frame with a known duration.
… Where do we go, we have 2 minutes of the call left?
… I want to help you get to the bottom of all of this!
… Given some people have dropped off, we are in a position where e.g. Chrome has no handling
… of timed text in media containers that I know of, and I don't know that they want to develop or
… implement that.
… As a WG there's a question of interop that we want to get to ideally with this.
… I think there are multiple questions.
… One is can we figure out the detail to make all of this consistent within your own implementation work
… and then separately how do we bring this to more implementations in general.
… That's something that my organisation is interested in getting us to,
… we would love to get this more widely supported.
… The difficulty is that engineers aren't motivated to figure it out if they're not implementing it.

alicia: Another question is if it would be feasible to write a polyfill for appendBuffer

eric: I think she was saying that it might be possible to do a polyfill to make it easier
… when there isn't native support.

alicia: yes that's what I was trying to say

cpn: Yes that would help

alicia: One of the problems with the polyfill approach is how to make sure you don't get
… both the polyfill text track and the browser text track

eric: We would need to figure out a way to feature detect, though I can't imagine
… how we would support feature detect for text tracks in MSE.

alicia: The more general problem for this type of polyfill, that I noticed a few weeks ago,
… with some trailers from the iTunes store, and the MP4 files have CEA608 captions,
… but the player in the page assumed they couldn't be rendered and instead it fetched a
… separate out of band WebVTT track
… You could imagine that if we get in band text tracks working then the many polyfills that
… exist could conflict with the browser implementation.

eric: Right, a polyfill could detect when text tracks are added and disable them itself,
… which is what the controls for the iTunes trailers should be doing but obviously aren't.

alicia: I don't think you support CEA608 in MSE do you?

eric: No, we do in MP4 and in transport streams but not in MSE
… That's via AVFoundation.

cpn: We are over time, but this is quite a valuable conversation.
… What do people want to do?

alicia: We've been talking for long enough

cpn: I would like concrete next steps

eric: Sounds like another meeting is needed, hopefully when I have a stable connection

cpn: We mentioned asking Cyril. Is there anyone else? DASH-IF may have some expertise here.
… (Iraj at the time)

eric: Gary might because of the polyfill work he did in his previous job

cpn: Let's try to get the right people together and reconvene then.
… Having all this better defined would be a good thing
… Happy to keep the conversation going to let us do that

nigel: Suggest offline planning.
… Alicia, anyone, if you have questions about TTML, please do ask me

eric: Apologies for missing most of the meeting

cpn: That's okay these things happen. Glad you're here.
… Meeting adjourned, thank you. Bye bye!

– DRAFT –
Media WG

17 June 2025

Attendees

Meeting minutes

Accidental trimming of overlapping text cues

Diagnostics