W3C

– DRAFT –
Media Timed Events

20 September 2021

Attendees

Present
Alicia_Boya, Chris_Needham, Gary_Katsevman, Iraj_Sodagar, Louay_Bassbouss, Nigel_Megitt
Regrets
-
Chair
Chris
Scribe
cpn

Meeting minutes

<gkatsev> https://github.com/w3c/webvtt/issues/496#issuecomment-921999893

Unbounded cues in WebVTT

Gary: I posted some comments on the issue. For unbounded cues, one of the things we have concerns about is backwards compatibility
… If you have a new unbounded cue, what do old parsers do?
… Thinking about it, I'm leaning towards what Rob was saying, that there isn't a good way to represent unbounded cues in the old way
… You may not know ahead of time what the unbounded cues represent - e.g., a cue will never get an end time
… Or if it will get an end time, but don't know when
… Could be represented differently, for example: if you know if it will never ends, you could put 99 hours as the end time. Or if it does have an end time, copy the cue in small duration increments

Nigel: Is that an argument for not needing anything more than what we have now?

Gary: For this use case (unbounded cues), it's fine if there's no way for them to show up in old parsers

Nigel: Looking at the use cases document?

Gary: I'm talking specifically about just being able to represent unbounded cues rather than a particular use case
… David asked about how likely this would go into WebVTT as MPEG would need to decide whether to keep their changes in or not
… Is it possible to have this in a narrow enough scope that this feature can be added now and expanded more later

https://github.com/w3c/media-and-entertainment/blob/master/media-timed-events/unbounded-cues.md

Chris: We haven't completed the use case list. Do we need them all to make progress on representation?

Gary: I don't think we need an exhaustive list of use cases so that we can decide whether to ship a constrained feature, then expand it to cover all the uses we'd want
… If we agree it can be constrained enough without blocking other use cases, we could reasonably ship it
… Otherwise go back to MPEG and say we'll be unlikely to ship within their timeframe

Chris: What dependency does MPEG have on our work?

Gary: They're using the API, adding support for unbounded cues, but they realise there's no defined representation yet
… So they're relying on us to implement that before they ship their next spec.
… If we don't have a representation they'll pull their changes

Chris: Summarise Rob's proposal?

Gary: If a cue is unbounded, just update that end time, and not allow anything else to be updated
… That could be constrained enough that it doesn't prevent doing the other things should we want to

Chris: Use cases include updating a cue end time from unknown to known
… Also change cue end time from some known time to another known time
… And updating other cue attributes

Chris: Consistency across existing implementations?

Gary: They'd ignore the first cue with missing end time

Chris: Is that an acceptable fallback behaviour? Because if not, you need a marker value such as 99 hours to represent unbounded?

<Zakim> nigel, you wanted to ask if there's been any development on the data model, in terms of "does a VTTCue represent state or presentation?"

Nigel: Looking at the document, returning to the data model topic: what is a cue?
… In segmented delivery where you keep sending a small chunk, and there may be repetition, you're not representing state, it's more here's what to do at a period of time
… In the updating a sports score use cases, it changes the use of a cue significantly. The cue payload includes some state, and the cue timing relates to the state
… So it seems fundamental that we should be clear about what it is. Any more consideration from that point of view? Is that a helpful way to think about it?

Gary: It's worth considering, but I haven't thought from that perspective that much
… One thing is that WebVTT currently has the karaoke mode, not that's implemented anywhere. In authoring you can say "show these words at this time for this cue"
… This seems to fall along those lines

Nigel: Is that the syntactical way of updating cues?

Gary: Yes, but no browser implements it, so may be removed

Nigel: It's a syntactic niceness. You could get that effect by repeating cues instead

Gary: Yes, but you get large VTT files. Would be nice to get karaoke mode for that reason

Nigel: It seems orthogonal. You're not updating an end time. The parser could update the payload

Gary: Live chapterisation is a question. We know when we change scenes we change the chapter. The sports score use case helped me better understand that
… The old marker needs to be updated

Nigel: In the chapterisation model, do you need to repeat the chapter information, in segmented media delivery you need to repeat it so you don't need to search back

Gary: Yes. I'm not sure that needs to be in the spec, but we'd want to have an answer for that
… It is possible for segmented media, you'd want to copy the unbounded cues over each time, potentially every segment

Chris: Segmented delivery has this issue regardless of unbounded cues. Is that defined anywhere?

Gary: May not be defined somewhere, but generally is repeated. We talked at FOMS about making a WebVTT Note that for cues over multiple segments and copy them through as many segments as necessary until the cue ends

Nigel: I don't think there's any presentation defined for chapters. So if you're providing content you'd have to understand the user experience you want.
… If you want a reasonable acquisition time for chapters you'd have to repeat it often enough. The client would have to understand it's the same chapter
… To my previous question: what is the VTTCue modelling. It seems to be enough information for the player to do what it needs to do, but not really modelling data changing over time
… So if you're referring to some data entity consistently from cue to cue, there'd need to be an external way to identify it.
… If the client needs a running view of the scoreline, it could. A VTT cue in your MP4 payload could be a "score" metadata cue with an id so you know it's a score
… and the cue with that id can be updated
… It's not the cue that represents the data that's changing, the cue is painting a state that's modelled in the application which updates the score
… So you don't need to set the cue as unbounded. For segmented delivery you know what's in the segment when you create it, so can use the segment timing for cue duration
… Then the application keeps its own model of the data and does whatever it needs to do

Gary: To support jumping into a live stream we have to chop the cue into multiple parts anyway, which is one of the fallback approaches for missing end time

Chris: So in the model Nigel described, we don't need unbounded cues as the cue timing is defined by the segment timing, and "unboundedness" is up to the application

Nigel: Yes

Gary: Would knowing that a cue is going to be repeated through multiple segments be useful to clients?

Nigel: As a data point, with VTT for captions if there's a requirement to teardown captions and rebuild them, a key UX point is that people don't want to see flicker
… To ways to achieve it: maintain an identifier so there's a contract between the data provider and consumer, so a cue with ID 43 in one segment is promised to be the same as the same ID in another segment
… The other technique is comparison of the payload, merge together if they're the same
… With the maintaining of IDs approach, need to define how that works across segments

Gary: Not defined in WebVTT

Nigel: We have the same problem with TTML, it's not defined

Gary: It could be a backwards compatibility issue. Although people don't use IDs in practice it may not be an issue

Nigel: So players would have to do comparisons and then do updates to the future state

Gary: A question is: in the live caption use case, if the content or cue settings differ but the id is the same, do you update the cue?

Nigel: Is the id scoped to the document? Use the begin and end time to work out what's visible at a given time
… The equivalent in the TTML model, where ids are scoped to the containing document, no claims are made about ids across segments
… In that case, do a model comparison between now and next (discounting ids). I favour having similar data models if we can

Gary: Makes sense to me

Chris: Would two segments be considered two different documents?

Gary: Right now, yes
… It sounds like what we're saying is that, for segmented WebVTT, if you have unbounded time events that you want to represent, you have to be able have the user not load all the captions from the beginning of time, copy the cues between segments
… It's not that different from having a bunch of short cues, while having a signal that tells us that this cue is going to be repeated, it's not necessarily required
… The signal would be the cue has unbounded end time, that indicates it will be repeated, until it ends and is assigned an end time

Nigel: This reminds me of something related. In VTT and TTML, things have beginning and end times, so everything has a duration.
… Do we need a concept of a "moment" in time?

Gary: An example is an ID3 metadata, right now the way those are done is you get a start time, and the end time is the duration of the video or the start time of the next cue in the ID3 cue points track
… That's a workaround for representing a moment in time that uses a duration

Nigel: [looking at WebVTT karaoke mode]
… The idea is to indicate a per-word start time with no duration. I think that's missing from the general timed text data model?
… People have suggested use cases where that could be useful
… You could have a metadata tag repeated in each segment, duration matches the segment. But what there isn't is an initial definition of the state, and then at a given moment here's a new state
… Reminded of a demo. The idea to be able to change the number of words shown at once based on dynamic display. Using timestamps instead of durations would allow you to customise that from a user perspective
… Could be more interesting than the idea of having unbounded cues, but it's a very different way of doing things. In the same way TTML doesn't support timed metadata without being attached to something in the document such as a div

Chris: Could do a separate out of band query to get the complete state at a moment in time?

Chris: Next steps?

Gary: A follow up meeting when Rob can join.
… We need to decide what to tell David on timescale

Nigel: I don't think we've identified a strong need to make a change from today's discussion, it just moves the logic elsewhere

Gary: Saying we don't think it'll go in right now is reasonable?

Chris: Do we miss an opportunity?

Gary: It may be another year or so, but shouldn't close the door forever, but it will severely delay it

Nigel: But that's on the basis that nothing is missing now. If we don't have use cases that can't be fulfilled now, less change is good

Chris: We've talked abotu this for timed metadata, what about also for captions?

Gary: Immediate use case doesn't require a spec change it seems

Nigel: If there is another model for live captions that poeple need, where it's firing a caption directly and minimising repetition, would need more discussion
… Possibly in an RTC scenario

Gary: We did get a question around it for live captioning recently https://github.com/w3c/webvtt/issues/320#issuecomment-917386887

Next meeting

Gary: Next week if possible, assuming Rob can make it

Chris: OK

[adjourned]

Minutes manually created (not a transcript), formatted by scribe.perl version 136 (Thu May 27 13:50:24 2021 UTC).

Diagnostics

Succeeded: s/now/no/

Maybe present: Chris, Gary, Nigel