Media Timed Events / Unbounded VTT Cues

Meeting minutes

Unbounded cues in WebVTT

Gary: At the last meeting, we concluded that the way things are now, there's no benefit to having unbounded cues in the text format

Gary: The reason why is that seeking to the middle of a stream, where as far as we can tell you'd have to copy each unbounded cue per VTT segment. Otherwise you'd have to load all the text tracks since the beginning of time, which isn't reasonable

Rob: I've read the minutes from last time, and discussion about updating cue attributes
… There's a requirements document

https://github.com/w3c/media-and-entertainment/blob/master/media-timed-events/unbounded-cues.md

<gkatsev> pr with updates to unbounded cues

Rob: From an unbounded cues point of view, requirement 1a, is what unbounded cues do
… There was some discussion about changing other cue attributes, but I don't think we had any use cases for that. Has that changed?

Gary: I think the main thing with other attributes, it's probably fine if narrow, but the worry is about preventing extending in the future to allow cues to be updated
… If we narrow the use case to updating unbounded cues to be bounded, so only the end time is set and nothing else changes, that could be restrictive enough to not be an issue

Rob: I'd generally agree. The scope should be limited as to what can be changed. There aren't use cases for changing other attributes
… We shouldn't rule it out

Gary: There are some for live captioning, but it's too early

Rob: Can be done with unbounded cues, in a different way where the update is done as a new cue, linked at a higher level. It's implementation-specific how to do that
… This comes back to matching by start time and content, which would allow content to be updated
… I'd argue that changing the start time or content should be a new cue, rather than changing an existing cue
… from the point of view of the VTT file format
… There isn't a mechanism to change existing cues. I don't think the syntax supports updating cues currently

Nigel: The discussion last time didn't identify a reason to do it
… We don't have a delivery mechanism in WebVTT. For video we have segments and we can bound the VTT cue time to the segment interval
… If necessary repeat the cues. Then you don't run into acquisition issues doing that
… Having some kind of external model, if you need to update state of a metadata entity, you can do that in segmented delivery in the same way
… Updating from chapter 1 to chapter 2 without needing to hunt back for old cues

Rob: I agree do don't want to have to look back previously. I didn't understand what you meant by metadata, are you meaning a (time, value) pair

Nigel: The entity you're modelling has a lifecycle, which is application specific

Rob: Are you treating changes as instantaneous events at a point in time, or as a value with duration?

Nigel: The information you send can be bounded to an interval, but what it's about can be changing in time
… The cue has a duration but the entity for which it provides duration may not have the same duration. It's a model maintained in the client application

Rob: For a single segment that starts at chapter 1 then chapter 2. Is there an instantaneous event that says "chapter 2 starts now"?

Nigel: In that application, I'd probably build it by saying there's always an active cue that describes the current chapter

Rob: WebVMT supports that, values can be set in an interval and unset at a later time
… Unbounded cues alllows that to be solved

Chris: Would it help to write this down?
… A worked example could be helpful
… There's an open PR to the use case document: https://github.com/w3c/media-and-entertainment/pull/77/files

Gary: This describes the sports score example, and live captioning

Chris: The requirements in the document might not be useful

Gary: You should represent unbounded cues as multiple bounded cues, always, and the unbounded-ness isn't in the cues
… You might overrun by a second when the cue becomes bounded
… WebVTT gets delivered in segments at a time. You can make cues be the duration of the segment. By the time you're ready to deliver the segment you can set the end time rather than end of segment time

Rob: I agree, the issue is that the unbounded cue has unknown end time, so you're setting a bounded cue with known end time
… so the problem comes if you set a bounded cue across the current segment

Gary: Yes, you're not sending ahead of time

Nigel: For live, the content is segmented

Gary: The framgneted MP4 is basically the same, with small chunks

Rob: That differs from the measurement observation use case. If you take a temperature measurement now, but you don't know when the next one will be
… When the next one arrives, you can update it to the next value
… In a live case, just use the last known value. But when you re-play it, you can interpolate from the last to the next value
… Makes it simple for implementations, just take a sequence of time values

Gary: The way i'd represent a time measurement in WebVTT is each measurement for a preceding period of time, looking back instead of looking forward

Rob: In the case I described, there's no need to look back

Gary: Some people use cues with same start and end time to represent an event, maybe that's the answer

Rob: That's the way WebVMT deals with discontinuities in the data, so if there's a break in the data, where there's no value, make an instantaneous cue to say there's no data

Gary: If you seek to the middle of the video, how do you know the state? Do you need to parse all the history?

Rob: Yes

Gary: That's a requirement we're trying to avoid
… With segmented WebVTT it'll parse just the current segment rather than any previous segments

Rob: Could solve that with unbounded cues, if you're assembling segments retrospectively, if there's an active unbounded cue, it's reasonable to assume it's still active in the next segment

Gary: Yes, and we concluded that you'd have to copy cues from segment to segment
… In the discussion, having that extra signal of unboundedness wasn't adding much as you have to copy the cues between segments

Gary: You'll either know the cue ends within the segment, or it ends at the same time as the segment

Rob: So it seems unbounded cues can be handled using bounded cues in segmented WebVTT

Chris: Does the client coalesce the cues into a contiguous long cue?

Gary: It doesn't. From a previous FOMS discussion, we talked about writing a note to describe avoiding flicker in rendering
… That may be something we want to do as part of this work

Rob: That could be on a per-use case basis

Gary: It's not specific to the format, it's about player implementations, so for a Note instead of in the spec

Rob: For timed metadata you wouldn't want it to repeat

Nigel: Good point. The ability to say that a cue is the same as a previous one is orthogonal but could be worth looking at
… You need a contract between producer and consumer of the files
… All of the specs at the moment only define well-formedness in a single file, not across multiple files

Chris: Rob, what was your understanding of live distribution?

Rob: For WebVMT, if you have recordings from a sensor on a resource limited device, send readings as they're taken. Unbounded cues help, because you don't know when the next reading will be
… So being able to supersede a value with a new value, recorded such that you can interpolate in playback
… If you record an unbounded cue at time A with a value, you can supersede it with another cue at time B, using an identity to link those two things together

Chris: Is there an example we can look at?

Rob: It's an open item to add that. It's been discussed but not added to the document

Chris: Also with WebVMT, use WebSockets for live delivery?

Rob: Yes

Chris: Let's follow up on that another all

Gary: So we can confirm to David that no syntax changes are to be made, and for the unbounded case you copy cues between segments, and they're bounded by segments

<RobSmith> WebVMT live interpolation examples: https://github.com/webvmt/community-group/issues/2#issuecomment-708529659

Chris: So we have:
… 1. A proposed model for delivering unbounded cues in segmented VTT

<Zakim> gkatsev, you wanted to ask about response to David and webvmt/webvtt alignment?

Chris: 2. A client processing model to describe how cues are coalesced (write as a Note)
… 3. How to identify cues across segment boundaries?
… 4. (possibly) live delivery over WebSockets or other non segmented media delivery

Chris: Would MPEG also need to have a solution for identifiers across segments?

Gary: I don't think so, at this stage

Chris: What next for this group?

Gary: Consider whether to adopt some WebVMT syntax changes into WebVTT. I'm unsure, but it's interesting topic

Rob: I'll need to look into live streams

Gary: Is the idea you merge documents client-side from multiple streams

Rob: Yes, there's a video stream and a VMT stream. The way it currently works, a drone embeds metadata into the MPEG file, so there's a postprocessing step to export that into WebVMT
… That's the main case I've been looking at so far. The live case would also be interesting

Gary: Longer term, useful to think about live captioning and potential for updating cues, e.g., a stenographer who wants to correct text already sent

Rob: Or voice recognition. You can mis-hear things and then go back and correct, with additional later context

Gary: 608 captions has some ability to do that

Next meeting

Chris: TPAC is coming up. Could meet on 11th?

Kaz: I can't make the 11th, but you can go ahead

Chris: I'll send an invite

[adjourned]

– DRAFT –
Media Timed Events / Unbounded VTT Cues

27 September 2021

Attendees

Meeting minutes

Unbounded cues in WebVTT

Next meeting