25693 – Need an event to determine when cues have been added to a TextTrack

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 25693 - Need an event to determine when cues have been added to a TextTrack

Summary: Need an event to determine when cues have been added to a TextTrack

Status:	RESOLVED WONTFIX

Alias:	None

Product:	WHATWG
Classification:	Unclassified
Component:	HTML (show other bugs)
Version:	unspecified
Hardware:	PC All

Importance:	P2 enhancement
Target Milestone:	Unsorted
Assignee:	Ian 'Hixie' Hickson
QA Contact:	contributor

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2014-05-13 21:00 UTC by Brendan Long
Modified:	2017-07-21 11:02 UTC (History)
CC List:	13 users (show)

See Also:

Attachments

Description Brendan Long 2014-05-13 21:00:09 UTC

When streaming live media, cues can be added to a TextTrack mid-playback. In order to attach events like onenter and onexit, there needs to be a way to detect these cues.

Currently this is only important for in-band text tracks, but presumably live WebVTT will exist eventually.

I propose changing the TextTrack interface to add:

    attribute EventHandler oncueadd;

And also adding some sort of event interface ("CueEvent", "CueListEvent", whatever), with an IDL like:

    interface CueEvent : Event {
        TextTrackCue[] cues;
    }

Where "cues" is a list of added cues. Alternatively, we could just list one cue, but that might cause us to fire a bunch of events at once (not sure if that's considered bad?).

This event should be fired when a new cue is added by the UA. I'm not sure if it should be fired when a cue is added through JS (TextTrack.addCue()). We should do whatever would be consistent with the rest of the spec.

Use-cases:

  * Ad insertion: We receive an event saying, "Switch to this ad at this time". We want to know as soon as we receive these events so we can start buffering the ad, and set an onenter event to switch to it.
  * Any other cases where someone would use oncueenter or oncueexit, with a live stream.

Comment 1 Brendan Long 2014-05-13 21:04:36 UTC

Oh, and I'm not sure we should fire this when the track first loads, or if we should only fire it for new cues after the "canplay" event. This should also be handled in whatever way would be consistent with the rest of the spec.

Comment 2 Ian 'Hixie' Hickson 2014-08-04 22:07:27 UTC

This seems reasonable to me. Implementors, do you wish to implement this?

Comment 3 Philip Jägenstedt 2014-08-05 08:46:30 UTC

Fine by me.

Comment 4 Brendan Long 2014-08-28 22:47:40 UTC

I'll do a WebKit implementation and see what their reviewers think.

Comment 5 Brendan Long 2014-09-04 22:04:42 UTC

Here's my WebKit implementation:

https://bugs.webkit.org/show_bug.cgi?id=136550

I'll try to get some feedback from WebKit reviewers.

I chose to fire an event every time a cue is added, even if it's before the "load" event has fired. This made the implementation (and testing) much easier.

Comment 6 Ian 'Hixie' Hickson 2014-09-05 17:11:04 UTC

Which 'load' event? Do you mean during parsing of the VTT file? I thought we specced that as atomic.

Comment 7 Brendan Long 2014-09-05 17:20:49 UTC

(In reply to Ian 'Hixie' Hickson from comment #6)
> Which 'load' event? Do you mean during parsing of the VTT file? I thought we
> specced that as atomic.

The HTMLTrackElement's "load" event. I was considering not firing "cueadd" events for cues added before the track is fully loaded, but it's less confusing to always fire them (so users don't need to listen to both "load" and "cueadd" to get all of the cues).

By "load" being atomic, do you mean that "cueadd" should only fire if the track loads without errors? We could potentially buffer these and only fire them once we reach the LOADED state.

Comment 8 Ian 'Hixie' Hickson 2014-09-08 17:01:57 UTC

I thought that as specced, a VTT file's cues were all added to the TextTrack at once, such that script could never observe an incomplete representation of a VTT file.

Comment 9 Brendan Long 2014-09-08 17:18:24 UTC

(In reply to Ian 'Hixie' Hickson from comment #8)
> I thought that as specced, a VTT file's cues were all added to the TextTrack
> at once, such that script could never observe an incomplete representation
> of a VTT file.

Hm right now WebKit seems to add the cues in chunks. If we don't want that to happen, I can make it build up a "cueadd" event to fire for each VTT file.

Comment 10 Ian 'Hixie' Hickson 2014-09-19 20:41:35 UTC

No, incremental loading is fine I guess. The HTML spec is agnostic on the subject, and I guess the VTT spec does imply that they are added incrementally.

So what did you implement exactly? You fire a 'cueadd' event on a TextTrack object each time its 'cues' interface has a cue added? Do you coalesce? What did you go with on the event interface side? Do you queue a task to do this or do you do it synchronously with the add in the same task as the one handling the parse? Do you do it for addCue() also?

It looks like from your test file that you make assumptions about the TCP packet boundaries, which seems dodgy. If we're parsing VTT incrementally, I don't think we can assume that the whole VTT file will be parsed at once, with one cueadd event. We should probably just have one cueadd event per cue, so that we don't expose packet boundaries.

Comment 11 Brendan Long 2014-09-22 20:21:16 UTC

(In reply to Ian 'Hixie' Hickson from comment #10)
> So what did you implement exactly? You fire a 'cueadd' event on a TextTrack
> object each time its 'cues' interface has a cue added?

Yes.

> Do you coalesce?

No, but if I get a bunch of cues at once (like when cues are read from a WebVTT file), I send them all in one event. I was assuming the spec would allow but not require coalescing, but maybe having multiple cues per event is a bad idea anyway.

> What did you go with on the event interface side? Do you queue a task to do this
> or do you do it synchronously with the add in the same task as the one
> handling the parse?

The event is asynchronous.

> Do you do it for addCue() also?

Yes.

> It looks like from your test file that you make assumptions about the TCP
> packet boundaries, which seems dodgy. If we're parsing VTT incrementally, I
> don't think we can assume that the whole VTT file will be parsed at once,
> with one cueadd event.

Yes, the test is flawed. I think it's probably easier to just do one-event-per-cue than to fix the test (although it could be done, by listening for events until we see all of the cues).

> We should probably just have one cueadd event per
> cue, so that we don't expose packet boundaries.

I'm fine with this approach. It's much easier to implement, and probably easier to handle on the JS side too.

Comment 12 Brendan Long 2014-09-23 16:40:19 UTC

I updated my WebKit implementation to send one CueEvent per cue.

Comment 13 Philip Jägenstedt 2014-10-08 12:20:35 UTC

One event per added cue sounds like a lot of overhead. Especially if you "queue a task" to fire the cueadd event, you can't optimize that away since the event listener may be added after the task to fire it was enqueued.

Having the task that runs the WebVTT parser synchronously fire the event would be better in that regard, but if there is a listener that's still a lot of events to fire.

A per-TextTrack list of newly added cues seems simple in principle. If the list is empty and you add a cue, queue a task to fire the event, otherwise there must be a pending task already. If we want to be really lazy we could just expose it as readonly attribute TextTrackCueList? addedCues; to avoid a new event interface.

Sorry if we've already been over these options, it's been a while...

CC Eric Carlsson.

Comment 14 Ian 'Hixie' Hickson 2014-10-08 17:05:29 UTC

The problem with buffering them is it makes the exact events unpredictable (e.g. depends on network latency, bandwidth, and local CPU speed and load). That's a recipe for interop issues.

Comment 15 Brendan Long 2014-10-08 19:47:38 UTC

(In reply to Philip Jägenstedt from comment #13)
> One event per added cue sounds like a lot of overhead. Especially if you
> "queue a task" to fire the cueadd event, you can't optimize that away since
> the event listener may be added after the task to fire it was enqueued.

It seems like in most cases, this overhead wouldn't be very significant, since cues don't happen very often. For example, in a normal video with cues every few seconds, we would expect every CueEvent to only have one cue, since they're too far apart to coalesce.

The three cases I can think of where this might matter are:

 1. When initially loading a static WebVTT file. On my machine, firing a bunch of cueadd events doesn't seem to effect the loading time at all. Do we expect this to be significant, maybe on phones trying to play extremely long videos? Could we handle this better by letting websites split up WebVTT files, so clients only need to look at the next x minutes of cues at once?
 2. In CEA-708, the timing information doesn't map well to WebVTT, so one solution (firing a bunch of short WebVTT cues until the CEA-708 cue ends) could lead to several cues per second (4-10 probably, depending on how accurate we want the timing to be). This case seems unlikely to effect most people though, and even in the worst case, would 10 cues per second be significant compared to 24 frames per second of video?
 3. Cues could also contain data which is meant to be handled by JavaScript, so theoretically there could be a lot of cues if the JavaScript application needs them. Do we expect anyone to need tens or hundreds of cues per second?

Comment 16 Ian 'Hixie' Hickson 2014-10-14 23:30:31 UTC

Case #1 is the most common case, right?

One option would be to explicitly make this depend on timing rather than packet sizes: if it's been more than a second since the last time the event fired, then batch up all the cues received and send it. It'll still be unreliable from machine to machine, but it'll be based on performance rather than network packets, so mildly less likely to cause problems. The question is, are there use cases where you need less than a second of latency for cues?

Comment 17 Brendan Long 2014-10-15 14:47:55 UTC

(In reply to Ian 'Hixie' Hickson from comment #16)
> Case #1 is the most common case, right?

For now. If that's all we're concerned about, we could use the list-of-cues version of the event, and just require that an entire WebVTT file be exposed as one event, no matter how long it is. We'll presumably have to make an exception when/if WebVTT supports live streams though.

> One option would be to explicitly make this depend on timing rather than
> packet sizes: if it's been more than a second since the last time the event
> fired, then batch up all the cues received and send it. It'll still be
> unreliable from machine to machine, but it'll be based on performance rather
> than network packets, so mildly less likely to cause problems.

Why would this be any better? It's still arbitrary, so why not just let it be arbitrary. Really the only difference between the one-event-per-cue and list-of-cues cases are:

function handleCueAdd(event) {
    doSomethingWith(event.cue);
}
// or
function handleCueAdd(event) {
    for (var i = 0; i < event.cues.length; ++i) {
        doSomethingWith(event.cues[i]);
    }
}

From an author's perspective, I don't think it matters at all how long the list in a particular event is.

> The question
> is, are there use cases where you need less than a second of latency for
> cues?

For CEA-708, the cues can occur pretty close to when they're supposed to be displayed. I think we might have over a second though. I'd be more concerned about non-traditional uses of cue data (arbitrary data). Who knows how people will want to use it?

Comment 18 Ian 'Hixie' Hickson 2014-10-15 20:16:43 UTC

The problem is that authors will write code like:

   m.oncueadd = function (event) {
     m.oncueadd = null;
     doSomethingWith(event.cues[3]);
   }

...and this'll work fine in testing, and then one day the server sends only 2 cues in the first TCP packet, and it breaks.

Comment 19 Brendan Long 2014-10-15 22:24:15 UTC

(In reply to Ian 'Hixie' Hickson from comment #18)
> The problem is that authors will write code like:
> 
>    m.oncueadd = function (event) {
>      m.oncueadd = null;
>      doSomethingWith(event.cues[3]);
>    }
> 
> ...and this'll work fine in testing, and then one day the server sends only
> 2 cues in the first TCP packet, and it breaks.

Like you said though, wouldn't this problem be the same if we made it time-based, even if it's less likely?

Also, I can't imagine a case where someone would try to use a particular cue in the event's cue list. If you know the value of a particular cue, then why look it up at all?

Comment 20 Ian 'Hixie' Hickson 2014-10-16 16:26:03 UTC

You might not know the value, but the cues might still be in a predictable order, especially metadata cues.

I think if it was time-based it'd be more understandable to authors what was going on, but yes, it would still be problematic. I don't have a good answer. :-(

Comment 21 Brendan Long 2014-10-16 20:12:31 UTC

I'm still not convinced that firing individual events would be a performance issue. If parsing the initial WebVTT file is the only problem, then we could just not fire events for cues discovered before reading the "loaded" state.

Comment 22 Ian 'Hixie' Hickson 2014-10-17 23:51:37 UTC

Philip?

Comment 23 Philip Jägenstedt 2014-10-29 15:13:18 UTC

Sorry for the hiatus. I also don't have a good answer :(

Of the ideas so far, not firing the event when the cue is added by the WebVTT parser (or scripts?) seems least bad to me... unless there are in-band tracks where all of the cues are at the very beginning, which would be the same kind of situation.

Note that in order to avoid the "doSomethingWith(event.cues[3])" problem the parser would have to fire events synchronously, otherwise scripts could still incorrectly assume that many cues are available when the first event is fired. That is a little bit annoying, because one has to consider (and write tests for) what happens when the event handler causes the <track> element to be garbage collected, destroying the parser.

Comment 24 Brendan Long 2014-10-29 15:35:18 UTC

(In reply to Philip Jägenstedt from comment #23)
> Note that in order to avoid the "doSomethingWith(event.cues[3])" problem the
> parser would have to fire events synchronously, otherwise scripts could
> still incorrectly assume that many cues are available when the first event
> is fired. That is a little bit annoying, because one has to consider (and
> write tests for) what happens when the event handler causes the <track>
> element to be garbage collected, destroying the parser.

Is it really that important to make sure that JS authors can't use the events incorrectly? It seems like it would be obvious how these events should be used, since they include the new cue as an attribute. If people want to ignore the event and guess which cues exist, that seems like a straightforward authoring mistake, not something we need to be concerned about.

Comment 25 Philip Jägenstedt 2014-10-29 19:29:37 UTC

Sure, it's possible to go overboard trying to prevent mistakes, I don't even know how to quantify these risks.

The solution which is hardest to make a mistake with is almost certainly one event per cue, fired synchronously when the cue is added to the track. It would be a bit atypical, since pretty much everything else around HTMLMediaElement queues async events.

Comment 26 Ian 'Hixie' Hickson 2014-11-26 23:49:52 UTC

(In reply to Brendan Long from comment #24)
> 
> Is it really that important to make sure that JS authors can't use the
> events incorrectly?

There's two side-effects of making mistakes easy, one obvious and easy for browser vendors to dismiss, and one less obvious and more expensive for browser vendors.

The first is simply that it makes it more likely that Web pages will be brittle, that the Web in general will be not a great experience for users. This is easy to dismiss because it's a bug in the page, so really authors should just know better, etc.

The second, though, is what happens next. Suppose we have this API, and a page assumes it can always get a hold of the second cue in the first event (or whatever). Now, assume this page works fine because their server always in fact sends the first two cues in one packet. Assume further that this page is hugely popular. It's the new facebook.com or something, except with authors who aren't as responsive. And now suppose that in a future revision of the your browser, you want to change the way TCP packets are handled. Maybe they get routed through an accelerator proxy. Maybe you are shipping a new handset with a new radio that fragments TCP packets differently. Maybe you are internally changing your IPC system so that only one cue actually makes it to the browser process at a time. Either way, you are now stuck. You have to hack your browser to make sure that the first two cues are always bundled together, even though they're not really. You have to have hacks to handle timeouts in case there aren't two cues. And so on.

The risk isn't just that pages will be buggy. It's that browser vendors will be massively constrained in the future. This isn't academic, we end up in this situation all the time.

Comment 27 Brendan Long 2014-12-01 16:20:35 UTC

Ok, so maybe using a list of cues in each event is a bad idea.

In reply to Philip Jägenstedt from comment #25)
> The solution which is hardest to make a mistake with is almost certainly one
> event per cue, fired synchronously when the cue is added to the track. It
> would be a bit atypical, since pretty much everything else around
> HTMLMediaElement queues async events.

I'm still not convinced that the cue events need to be synchronous.

> Note that in order to avoid the "doSomethingWith(event.cues[3])" problem
> the parser would have to fire events synchronously, otherwise scripts
> could still incorrectly assume that many cues are available when the first
> event is fired. That is a little bit annoying, because one has to consider
> (and write tests for) what happens when the event handler causes the <track>
> element to be garbage collected, destroying the parser.

Don't we have essentially the same problem with every other HTMLMediaPlayer-related event? I think writing a few more tests is worth it to keep these events asynchronous.

Comment 28 Ian 'Hixie' Hickson 2014-12-01 19:23:44 UTC

I'm not sure what "synchronous" and "asynchronous" mean in this context. Synchronous with what?

Comment 29 Brendan Long 2014-12-04 23:09:43 UTC

(In reply to Ian 'Hixie' Hickson from comment #28)
> I'm not sure what "synchronous" and "asynchronous" mean in this context.
> Synchronous with what?

I meant that we fire the event and then keep moving forward instead of waiting for JavaScript to handle it. I think this is how all media events currently work?

Comment 30 Ian 'Hixie' Hickson 2014-12-19 20:00:24 UTC

Firing an event is an operation that returns only once the event is handled. Do you mean queue a task to fire the event? If so, then it depends. Some of the tasks that fire events do other things (like set state, equivalent to adding the cues here) at the same time.

Comment 31 Brendan Long 2014-12-19 20:59:10 UTC

I meant we queue a task to fire the event.

Comment 32 Ian 'Hixie' Hickson 2014-12-22 23:58:24 UTC

Well the risk described in comment 26 would be a real risk if the cues are added in a different task than the task that fires the events.

Comment 33 Anne 2016-03-28 13:31:51 UTC

Brendan, is this still something you're interested in?

Comment 34 Brendan Long 2016-03-28 14:43:30 UTC

I still think having a useful "new cue" event is a good idea, but I don't have time to work on it anymore.

Comment 35 Anne 2017-07-21 11:02:30 UTC

If this is still desired, please file a new issue at https://github.com/whatwg/html/issues/new. (I suspect that if this ever becomes a thing it'll be big in libraries first.)