18029 – <track> Live captioning

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 18029 - <track> Live captioning

Summary: <track> Live captioning

Status:	RESOLVED WONTFIX

Alias:	None

Product:	WHATWG
Classification:	Unclassified
Component:	HTML (show other bugs)
Version:	unspecified
Hardware:	All All

Importance:	P3 normal
Target Milestone:	Needs Research
Assignee:	Ian 'Hixie' Hickson
QA Contact:	contributor

URL:	http://www.whatwg.org/specs/web-apps/...
Whiteboard:
Keywords:

Depends on:
Blocks:	23414
	Show dependency tree / graph

Reported:	2012-07-18 15:59 UTC by contributor
Modified:	2016-03-17 05:30 UTC (History)
CC List:	16 users (show)

See Also:

Attachments
Naresh Dhiman (2 bytes, patch) 2014-04-03 22:49 UTC, Naresh Dhiman	Details

Description contributor 2012-07-18 15:59:06 UTC

This was was cloned from bug 14104 as part of operation convergence.
Originally filed: 2011-09-11 15:02:00 +0000
Original reporter: Silvia Pfeiffer <silviapfeiffer1@gmail.com>

================================================================================
 #0   Silvia Pfeiffer                                 2011-09-11 15:02:29 +0000 
--------------------------------------------------------------------------------
Several service providers currently provide live captioning through something called "streaming text". This means that a text file is provided that contains the captions and the file is grown over time with new cues being appended.

This use case cannot currently be supported by the track element, because the video element waits to go to ready state until the complete WebVTT file is loaded and then never re-checks if anything has been added to the WebVTT file. This approach makes sense for the canned case, but for live streaming text it won't work.

It would be good to support this use case. There is a fundamental difference between streaming text and canned text track files in that the first one provides cues unreliably, while the second one has a fixed and pre-defined list of cues. Thus, it is probably necessary to put some sort of flag onto streaming text tracks - maybe a @live attribute or something similar. This would remove the requirement to wait with ready state for full loading of the WebVTT file and would require the browser to continue adding cues when the file size changes.
================================================================================
 #1   Ian 'Hixie' Hickson                             2011-09-14 22:23:06 +0000 
--------------------------------------------------------------------------------
In the streaming case, how do you know what timestamps to use?
================================================================================
 #2   Ralph Giles                                     2011-09-15 01:07:31 +0000 
--------------------------------------------------------------------------------
Playback is generally happening inside the context of a media element, so we would compare the timestamps on the cues against the currentTime attribute of that media element to determine which are active.

The way I envisioned this working is that a <track> elements points to a src url handled by a server which holds the connection open, sending new WebVTT cues as they become available. The media element would then make a best effort to display those cues at the appropriate time.

There is certainly an issue with knowing whether you have the next cue or not. The current spec addresses this by requiring that the entire caption file be loaded before playback can begin. While one can certainly implement the live case with XHR and mutable text tracks, I think it is preferable to allow live streaming of captions with just the <track> element.

The user agent does know when it has the next cue in the stream, so one possible solution is to trigger ready state on that. Or we could leave it to best effort in the live case and see how it works.
================================================================================
 #3   Philip J                                        2011-09-15 06:51:45 +0000 
--------------------------------------------------------------------------------
There are two problems here:

1. In the live case currentTime will start at 0 when you start watching, so a different WebVTT file has to be sent to every user.

2. There is currently no way to differentiate between a slow network and a live stream, so how would one know when to go to HAVE_METADATA?
================================================================================
 #4   Ian 'Hixie' Hickson                             2011-09-15 23:02:28 +0000 
--------------------------------------------------------------------------------
Problem #1 is what I was referring to. I don't understand #2, can you elaborate?
================================================================================
 #5   Philip J                                        2011-09-16 08:16:37 +0000 
--------------------------------------------------------------------------------
(In reply to comment #3)

> 2. There is currently no way to differentiate between a slow network and a live
> stream, so how would one know when to go to HAVE_METADATA?

Currently, loading the tracks block the media element readyState, it will go to HAVE_METADATA only when the track is loaded. If the track doesn't finish loading because the connection is intentionally kept open, the video will never be able to play. I can't see a way to make an exception for the streaming case, because there's no difference on the HTTP level between streaming and the network being slow.
================================================================================
 #6   Silvia Pfeiffer                                 2011-09-17 15:20:39 +0000 
--------------------------------------------------------------------------------
(In reply to comment #3)
> There are two problems here:
> 
> 1. In the live case currentTime will start at 0 when you start watching, so a
> different WebVTT file has to be sent to every user.
> 
> 2. There is currently no way to differentiate between a slow network and a live
> stream, so how would one know when to go to HAVE_METADATA?

#1 us solved by initialTime IIUC.
http://www.whatwg.org/specs/web-apps/current-work/multipage/the-video-element.html#dom-media-initialtime

When the browser connects to the live video stream, the currentTime is 0, but the initialTime will be whatever time has passed since starting to stream it. Then the browser can use that initialTime as an offset to take it off the time stamps provided in the WebVTT file.
================================================================================
 #7   Philip J                                        2011-09-19 08:07:53 +0000 
--------------------------------------------------------------------------------
AFAICT, initialTime is for an initial offset that is actually on the timeline, e.g. given by a media fragment URI, not for a position that is before the stream begun. There is also startOffsetTime, but for that to be usable the captions themselves would also need to have a start date.

In any case, do you mean that the browser will natively sync the captions of live streams to make up for the timeline difference, or that scripts will be able to do so?
================================================================================
 #8   Ian 'Hixie' Hickson                             2011-09-19 22:44:09 +0000 
--------------------------------------------------------------------------------
Without a better understanding of how this is intended to work, I don't know how to fix this.

Starting to play a video before the cues have all been received seems like a bad idea in the general case, since the cues might be out of order, there might be an arbitrarily large number of cues at the very first frame of video, etc. In fact, WEBVTT simply isn't set up to handle streaming — given a situation where the UA has received the first 5 minutes of video and has received 5MB of cues including one at 4 minutes and 55 seconds, even if the cues were assumed to be ordered, there'd be no way to know whether all the cues had been received yet or not. If they had not, you'd want to pause or you'd miss cues (possibly _all_ the cues, if the cue stream is just a few seconds behind where the video stream is at, and the user never pauses to let it catch up).

The timing issue is also pretty serious. Since a streaming video can be joined at an arbitrary time, and that time is considered t=0 unless the stream has explicit timestamps (a pretty advanced feature as far as I can tell — does anyone support this and have a video that they can demonstrate it with?), there's simply no way that I can see for the system to know what times to use in the cues except for the server to somehow recognise which video stream was associated with which user — and then for the system to handle restarts. This is especially problematic as presumably you wouldn't want unique timings for each user anyway.

WebVTT was designed for static resources, where the users creating the subtitles are as likely as not to be independent of the users creating the videos. For dynamic streams with captions, it seems highly unlikely that you'd have anyone doing the captions other than someone directly affiliated with the original stream, and in that case you really should just have the stream itself contain the titles. As far as I can tell that would solve all these problems neatly.
================================================================================
 #9   Silvia Pfeiffer                                 2011-09-27 06:56:24 +0000 
--------------------------------------------------------------------------------
I'm pretty sure that if we don't solve this, people will work around it in JavaScript by e.g. continuously reloading a new @src into the active <track> element, going back to the same file, which has in the meantime changed size and has some additional cues at the end of the previously retrieved byte range end.

We can leave this for now and solve it at a later stage if we prefer to encourage people to use the JavaScript API rather than the <track> element for the live use case.
================================================================================
 #10  Ian 'Hixie' Hickson                             2011-10-01 00:15:06 +0000 
--------------------------------------------------------------------------------
The JS API was actually designed in part for this purpose, so you could stream cues to add to through the API.

Note that doing it by constantly reloading the src="" wouldn't work, for the reasons given in comment 8 paragraph 3.
================================================================================
 #11  Silvia Pfeiffer                                 2011-10-01 08:42:49 +0000 
--------------------------------------------------------------------------------
(In reply to comment #10)
> The JS API was actually designed in part for this purpose, so you could stream
> cues to add to through the API.

I appreciate that. I expect, though, that there will be two ways of dealing with "streaming text" - one that will be fully JS based, and one that will be file-based.


> Note that doing it by constantly reloading the src="" wouldn't work, for the
> reasons given in comment 8 paragraph 3.

That can be overcome by providing the cues always wrt the video's start-time and giving the page information about how much time has passed since the video's original start time.


(In reply to comment #7)
> AFAICT, initialTime is for an initial offset that is actually on the timeline,
> e.g. given by a media fragment URI, not for a position that is before the
> stream begun.

So initialTime and a media fragment URI's offset time are identical - I would think that we don't need initialTime then, since we can get it out of the URI.


> There is also startOffsetTime, but for that to be usable the
> captions themselves would also need to have a start date.

Yeah, that maps the video's zero time to a date, which isn't quite what we need.

What we need is basically a secondsMissed, which is the number of seconds that have passed since the start of the stream which the viewer has missed when joining this stream live. Given that the times in the WebVTT file would be relative to that original start time, you can calculate when the cues would need to be presented.

> In any case, do you mean that the browser will natively sync the captions of
> live streams to make up for the timeline difference, or that scripts will be
> able to do so?

Being able to use the native display would be the goal.

For scripts to be able to do so, they need the secondsMissed information, too, which they would need to get from a data-* attribute from the server. Then scripts would be able to do a custom caption display.

So, I guess what we would need to change to support this use case are the following:
* introduce a secondsMissed attribute for live streams
* introduce a reload mechanism for <track> elements
* introduce a "next" end time keyword in WebVTT
================================================================================
 #12  Ian 'Hixie' Hickson                             2011-10-25 04:40:54 +0000 
--------------------------------------------------------------------------------
Could you elaborate on those bullet points?
================================================================================
 #13  Silvia Pfeiffer                                 2011-10-29 08:56:28 +0000 
--------------------------------------------------------------------------------
Sure.

The @secondsMissed attribute would be an attribute on the media element that says how many seconds ago the stream had started, i.e. what time offset currentTime=0 maps to. Given that the times in the WebVTT file would be relative to that original start time, you can calculate when the cues would need to be presented at by calculating currentTime + secondsMissed as the video's playback time.

The reload mechanism on the <track> elements would mean that when the currentSrc resource's last cue has been read and is before the end of the video, while the video continues to load, do a HTTP byte range request for bytes on the currentSrc resource after the end of the file. E.g. if the currentSrc resource is 600KB long, then do a GET request with Range: bytes=600- . This would be repeated while more video is being downloaded and stopped otherwise. The repetition frequency would likely be attached to the request rate of the video.


As for the change in WebVTT, the idea is that the following would be legal:

WEBVTT

00:00:00.000 --> next
This cue lasts until the next cue replaces it, i.e. 5.6 sec.

00:00:05.600 --> next
Same here, i.e. 4.4 sec.

00:00:10.000 --> next
If none follows, the cue lasts until the end of the video.


These three together would allow for live captioning through streaming text.
================================================================================
 #14  Ian 'Hixie' Hickson                             2011-10-30 18:02:16 +0000 
--------------------------------------------------------------------------------
I don't understand how anyone would be able to fill in secondsMissed="".
================================================================================
 #15  Silvia Pfeiffer                                 2011-10-31 02:02:07 +0000 
--------------------------------------------------------------------------------
The server knows how long the video had already been running, so it can fill it in. It does require a server component to keep track of how long the video has been streaming, but that information is typically available to the server, in particular when the server is recording the stream at the same time like ustream and livestream do. For example, this ustream video starts at 15:45 http://www.ustream.tv/recorded/18146941/highlight/212855 rather than 0.
================================================================================
 #16  Ian 'Hixie' Hickson                             2011-11-02 19:47:57 +0000 
--------------------------------------------------------------------------------
That doesn't work, because it doesn't take into account the delay between the server writing the HTML page and the client parsing it.

Also, if you can update the media server such that it has such close control over the HTTP server, then why not just put the text tracks in the media directly?
================================================================================
 #17  Silvia Pfeiffer                                 2011-11-02 21:04:37 +0000 
--------------------------------------------------------------------------------
OK, I guess we can leave the attribute for now.

The idea of introducing a "next" or similar special value for end times into WebVTT is still a useful one. Could we still explore this then.
================================================================================
 #18  Ian 'Hixie' Hickson                             2011-11-03 16:25:57 +0000 
--------------------------------------------------------------------------------
"next" seems like merely syntactic sugar, so I don't know how helpful it would be (writing subtitles is, I hope, mostly done using editing tools). But please file a separate bug for that if it has good use cases.

As far as this bug goes, I find myself back at comment 8. I don't see how to do this in a sane way.
================================================================================
 #19  Ian 'Hixie' Hickson                             2011-11-11 20:02:58 +0000 
--------------------------------------------------------------------------------
Status: Did Not Understand Request
Change Description: no spec change
Rationale: Not sure how to resolve this with out-of-band text tracks.
================================================================================
 #20  Silvia Pfeiffer                                 2011-11-12 21:58:35 +0000 
--------------------------------------------------------------------------------
If you go to any of the videos on http://www.youtube.com/live , they all show as the time at the beginning of the video the actual time that has passed since the video started streaming. These times are easily matched to the time stamps on a streaming text service such as http://streamtext.net/.

In fact, YouTube have used the services of a streaming text service before to caption live video: http://gigaom.com/video/youtube-launches-live-captions-at-google-io/. This ran exactly how I described: when a viewer started watching the video, they also received the streaming text with the time stamps since the beginning of the transmission, which the video player was then able to synchronize.

It is possible to implement this fully in JavaScript with the current specifications so I am not going to reopen this. But I expect that with the progress that we make for WebRTC we may also need to revisit this issue and wanted to leave more details on this bug.
================================================================================
 #21  Ian 'Hixie' Hickson                             2011-11-28 23:56:52 +0000 
--------------------------------------------------------------------------------
For those live streams, the video seems to include an internal time, which the captions presumably use as well. So that's rather different than what you were proposing.

For that kind of case, what we'd really want is not a static file to download, it would be a stream. You'd want to tell the server around when to start (presumably automatically), and you'd want to update the cues in real time, presumably throwing cues away that are before the earliest start time.

That doesn't seem too unreasonable.

To support things like inline live corrections, though, we'd probably want a different format than WebVTT, or at least some variant on it. e.g.:

--------------8<--------------
WEBVTT

00:00.000 --> 00:05.000
captions that were available before the user connected

01:00:00.000 --> 01:02:00.000
bla bla bla

LIVE--> align:middle rollup-size:3
<01:03:11.004> Bla bla <01:03:11.353> bla <rollup> <01:03:11.653> bal <01:03:11.710> <redoline> bla <01:03:12.004> bla bla...
-------------->8--------------

...where in a LIVE block, timestamps indicate when the following text should appear, <rollup> indicates when to go to the next line, <redoline> indicates that the current line should be deleted in favour of new text... This is just a strawman, I don't know what the right solution is here.

In particular, what should happen if you seek backwards to a point between when a line was rolled up and a correction was made? Should we show the incorrect text, or should the incorrect text be dropped in favour of the new text? When should the new text appear, should it appear at the time of the incorrect text? How should corrections be made? Should anything be allowed after a LIVE block, or is a LIVE block forcibly the last block of a file?
================================================================================
 #22  Silvia Pfeiffer                                 2011-12-04 01:53:49 +0000 
--------------------------------------------------------------------------------
(In reply to comment #21)
> For those live streams, the video seems to include an internal time, which the
> captions presumably use as well. So that's rather different than what you were
> proposing.

It's the time since the stream was started, and that's exactly what I was referring to. I don't understand how that makes a difference.


> For that kind of case, what we'd really want is not a static file to download,
> it would be a stream.

Agreed. That's what I meant with a "streaming text" file.


> You'd want to tell the server around when to start
> (presumably automatically), and you'd want to update the cues in real time,
> presumably throwing cues away that are before the earliest start time.

The streaming text file is a separate resource from the video and it contains cues with times synchronized with the beginning of the video. New cues are added at the end of the file. It can be either the server throwing away captions that are from before the earliest start time, or it can be the browser which knows the start time of the video and can tell which cues are in the past.


> That doesn't seem too unreasonable.

Cool.

> To support things like inline live corrections
>, though, we'd probably want a
> different format than WebVTT, or at least some variant on it. e.g.:
> 
> --------------8<--------------
> WEBVTT
> 
> 00:00.000 --> 00:05.000
> captions that were available before the user connected
> 
> 01:00:00.000 --> 01:02:00.000
> bla bla bla
> 
> LIVE--> align:middle rollup-size:3
> <01:03:11.004> Bla bla <01:03:11.353> bla <rollup> <01:03:11.653> bal
> <01:03:11.710> <redoline> bla <01:03:12.004> bla bla...
> -------------->8--------------
> 
> ...where in a LIVE block, timestamps indicate when the following text should
> appear, <rollup> indicates when to go to the next line, <redoline> indicates
> that the current line should be deleted in favour of new text... 

I'd like to keep the rollup and redo-line problems separate. The rollup problem is applicable not only to live captioning, but as a general problem. We have a discussion in the Text Tracks Community Group about it right now with different options, so I'd like to defer the problem there. Also, the redo-line problem is a new one that again should be solved independently from live captioning.

So, I just want to focus on the timing part of this problem, which is also a <track>-related problem, not just a WebVTT problem.

Your suggestion of introducing a "LIVE" cue without timing has one big problem: all captions for a video end up being in a single cue. That's not readable, not easy to edit, and hardly easy to re-stream: it would be difficult to determine what is still active when seeking to a specific offset.

My approach was to allow cues to be active until the next cue appears. (Incidentally, for rollup captions this could be adapted to being active until the next three cues appear.)

For example instead of this (endless) cue:

--
> LIVE--> align:middle rollup-size:3
> <01:03:11.004> Bla bla <01:03:11.353> bla <rollup> <01:03:11.653> bal
> <01:03:11.710> <redoline> bla <01:03:12.004> bla bla...
--

you would have something like:

--
01:03:11.004 --> NEXT(3) align:middle
<01:03:11.004> Bla bla <01:03:11.353> bla

01:03:11.653 --> NEXT(3)
bal <01:03:11.710> <redoline> bla <01:03:12.004> bla bla...
--

The third start time of a cue after the current cue should be easy to determine in code.


> This is just a
> strawman, I don't know what the right solution is here.

Yeah, I am not 100% sure what is best either, but finding advantages/disadvantages with some markup is certainly good.


> In particular, what should happen if you seek backwards to a point between when
> a line was rolled up and a correction was made? Should we show the incorrect
> text, or should the incorrect text be dropped in favour of the new text?


During live streaming, no seeking should be possible. So, that problem would not occur. Usually for captions that were done live, there is some post-production. The post-production would typically remove all redone characters.

Also, events are handled as they are reached, so if the redone stuff is still there, then playback would exactly replicate the original changes again, which it should.
================================================================================
 #23  Philip J                                        2011-12-05 10:12:00 +0000 
--------------------------------------------------------------------------------
(In reply to comment #22)

> During live streaming, no seeking should be possible.

A live stream is basically just a resource which is not byte range seekable, but that doesn't mean that the client can't seek in the data they have buffered. I think Firefox already does this and we (Opera) want to do it.
================================================================================
 #24  Ian 'Hixie' Hickson                             2011-12-05 22:20:30 +0000 
--------------------------------------------------------------------------------
Roll-up captions are, as far as I can tell, just one cue, that happens to have internal state (much like a karaoke cue, actually).

The problem with using the current cue concept for live cues is that you don't know when the cue will end (which it may well do before the end of the stream, e.g. if the live stream has prerecorded segments spliced in, e.g. ads), yet our cue format puts the end time before the cue text. Hence the desire for a different format for live cues.

Streaming definitely doesn't preclude seeking. Live cues definitely don't preclude streaming. The question of what to do with edited cues when seeking back seems quite valid to me.
================================================================================
 #25  Ian 'Hixie' Hickson                             2011-12-05 22:38:15 +0000 
--------------------------------------------------------------------------------
Another issue we have to consider is what to do when the latency on the subtitle stream is such that it is several seconds behind the video stream. If we don't have a mechanism where the current time is being transmitted continuously even when no captions are to be shown, there's not really any way I can see for the UA to know whether or not it is missing captions (and should stall the video) or not.
================================================================================
 #26  Ian 'Hixie' Hickson                             2011-12-05 23:29:50 +0000 
--------------------------------------------------------------------------------
Interestingly, the thread to which I replied here:
   http://lists.w3.org/Archives/Public/public-texttracks/2011Dec/0033.html
...includes several anecdotal data points (and one reference to some research) suggesting that even for live captioning, we might want to focus on pop-up captions and not support roll-up captions.
================================================================================
 #27  Silvia Pfeiffer                                 2011-12-07 23:30:42 +0000 
--------------------------------------------------------------------------------
(In reply to comment #23)
> (In reply to comment #22)
> 
> > During live streaming, no seeking should be possible.
> 
> A live stream is basically just a resource which is not byte range seekable,
> but that doesn't mean that the client can't seek in the data they have
> buffered. I think Firefox already does this and we (Opera) want to do it.

That's ok. It's client-side seeking only. In this case, I would simply replay everything exactly how it was received before, including the edits.
================================================================================
 #28  Silvia Pfeiffer                                 2011-12-08 00:31:08 +0000 
--------------------------------------------------------------------------------
(In reply to comment #24)
> Roll-up captions are, as far as I can tell, just one cue, that happens to have
> internal state (much like a karaoke cue, actually).

Roll-up is a means of displaying lines of text. It doesn't matter whether they are in one cue or in many cues. Text from several cues should be able to be added to a previous cue and make it roll up.

 
> The problem with using the current cue concept for live cues is that you don't
> know when the cue will end (which it may well do before the end of the stream,
> e.g. if the live stream has prerecorded segments spliced in, e.g. ads), yet our
> cue format puts the end time before the cue text. Hence the desire for a
> different format for live cues.

Again: let me decouple rollup from this problem, since rollup is a means of display, not a timing means. I want to focus on the timing issues.

Even cues for which we don't know the end time at the time of their creation, there is a time when the end time is known. This time is typically the appearance of another cue. Thus, a cue's end time can be set in relation to the start time of a future cue.

This is the problem that I tried to solve. It is independent of rollup, because this may happen with pop-on captions, too.
================================================================================
 #29  Philip J                                        2011-12-09 10:03:04 +0000 
--------------------------------------------------------------------------------
(In reply to comment #27)
> (In reply to comment #23)
> > (In reply to comment #22)
> > 
> > > During live streaming, no seeking should be possible.
> > 
> > A live stream is basically just a resource which is not byte range seekable,
> > but that doesn't mean that the client can't seek in the data they have
> > buffered. I think Firefox already does this and we (Opera) want to do it.
> 
> That's ok. It's client-side seeking only. In this case, I would simply replay
> everything exactly how it was received before, including the edits.

I don't understand, when we seek we invalidate what cues are visible and start fresh. It has to be defined what happens with the changes you propose, since if the cues are mutated by later cues' arrival one can't "simply replay everything exactly how it was received".
================================================================================
 #30  Silvia Pfeiffer                                 2011-12-11 09:40:54 +0000 
--------------------------------------------------------------------------------
OK, this is way off topic from what the bug was originally registered for. But I'll run with it.

Ian suggested to introduce an "editing" command into cues called <redoline>. This is a command that, along the timeline changes something that has been displayed before. For example a line that starts at with the text "I ma hnugry" would be erased a few seconds later with the <redoline> command and replaced with the text "I am hungry". As I understand it, the markup would look something like this:

LIVE--> rollup-size:3
<00:00:00.000> I ma hnugry <00:00:05.000> <redoline> I am hungry

This specifies a clear display order, even with the changes.

So, when you seek back to any time between 0 and 5 sec, you display the "I ma hnugry" text again, and for any time after 5 sec, you display the "I am hungry" text. This is what I mean by "replay everything exactly how it was received".

Hope that clarifies what I meant.
================================================================================
 #31  Philip J                                        2011-12-11 11:02:49 +0000 
--------------------------------------------------------------------------------
(In reply to comment #30)

> Hope that clarifies what I meant.

I does, although what makes sense here strongly depends on what the in-memory model for the cues is supposed to be. If <redoline> mutates the existing cues then it'd (much) simpler to just show the correct version. The alternative is that we have special rendering rules for live cues that collapse lines.
================================================================================
 #32  Ian 'Hixie' Hickson                             2012-01-31 22:26:26 +0000 
--------------------------------------------------------------------------------
(In reply to comment #30)
> 
> LIVE--> rollup-size:3
> <00:00:00.000> I ma hnugry <00:00:05.000> <redoline> I am hungry
> 
> So, when you seek back to any time between 0 and 5 sec, you display the "I ma
> hnugry" text again, and for any time after 5 sec, you display the "I am hungry"
> text. This is what I mean by "replay everything exactly how it was received".

That seems like a net worse user experience than only showing the correctly-spelt text.
================================================================================
 #33  Silvia Pfeiffer                                 2012-02-01 11:14:34 +0000 
--------------------------------------------------------------------------------
I think it's unfaithful to have the browser make changes to the presentation when seeking back to it. What if the viewer has just seen something funny being typed and wants to rewind to it to show his friend who missed? 

I think we should leave such changes up to a content editor who is going to re-publish the live streamed video & captions with improvements, be that fixes to the video, or fixes to the captions.
================================================================================
 #34  Philip J                                        2012-02-06 12:17:42 +0000 
--------------------------------------------------------------------------------
If so, what should the in-memory model be? What do you see via the DOM APIs?
================================================================================
 #35  Silvia Pfeiffer                                 2012-02-06 21:24:09 +0000 
--------------------------------------------------------------------------------
We haven't mapped <redoline> to anything in HTML5 yet. It should probably map to a "display: none" on the previous span and then create a new one at the given time instant.
================================================================================
 #36  Ian 'Hixie' Hickson                             2012-04-19 23:44:13 +0000 
--------------------------------------------------------------------------------
You really want to optimise for making fun at transcription errors rather than for caption quality?
================================================================================
 #37  Silvia Pfeiffer                                 2012-04-20 01:26:51 +0000 
--------------------------------------------------------------------------------
I'd actually prefer if we didn't get side tracked in this bug with introducing markup for editing cues. I don't actually think there is a big requirement for editing on the Web, since for live captioning & streaming we can delay the transmission to make sure the captioner has finished fixing their transcript.

What I was trying to focus on in this bug was not the WebVTT markup, but the problem that WebVTT files may get changed while the video is playing back and that this is a legal use case for live streaming and that in this case the browser needs to reload the WebVTT file frequently.

The problems that Philip lists can be addressed:

#1 would make use of the startDate together with the currentTime to know when to display cues

#2 every time the WebVTT file has been loaded, the browser returns to HAVE_METADATA - it doesn't need to wait

In fact, the server could be clever and provide to newly connected clients just the cues that relate to the part of the video that they are currently looking at. Though in this case it's almost identical to loading the captions via JS.

What the browser still doesn't know is when it has to reload the WebVTT file. That's why I suggested a @live attribute which would cause the browser to frequently reload the WebVTT file.

I can see three different ways to trigger a reload:
* whenever the browser runs out of cues for a video (in fact, this could likely be useful in general)
* when a startDate is given on the video and the browser runs out of cues
* when a @live attribute is given on the video and the browser runs out of cues

Also, we'd need to limit the number of reload tries if there are no changes to the WebVTT file.
================================================================================
 #38  Ian 'Hixie' Hickson                             2012-06-26 20:12:08 +0000 
--------------------------------------------------------------------------------
I had assumed that live subtitling would necessarily need to include support for live subtitling correction. If this is not the case, that changes the design space substantially. If anything, it makes the issue in comment 25 more critical, since we can no longer show the cues incrementally but need to know when we have the whole cue and when we're missing the next cue.

I don't understand what you mean about reloading the WebVTT file. Surely the only sane way to do live streaming is to stream.
================================================================================
 #39  Silvia Pfeiffer                                 2012-06-29 04:46:16 +0000 
--------------------------------------------------------------------------------
(In reply to comment #38)
> I had assumed that live subtitling would necessarily need to include support
> for live subtitling correction. If this is not the case, that changes the
> design space substantially.

It is not the case. It's a feature that we may introduce at a later stage, but not one I've seen used anywhere on the Web in live streamed captions.

> If anything, it makes the issue in comment 25 more
> critical, since we can no longer show the cues incrementally but need to know
> when we have the whole cue and when we're missing the next cue.

The video continues to provide the timeline, of course. The browser can only do best effort. If there is a cue that has to be displayed at a certain time, because the video's time has reached it, but the cue has not been received yet by the browser because the latency on the subtitle stream is higher than on the video stream, then it can't be displayed (the browser wouldn't even know it existed). However, video requires more bandwidth than text tracks in general, so I don't see this problem occuring frequently.
 
> I don't understand what you mean about reloading the WebVTT file. Surely the
> only sane way to do live streaming is to stream.

I don't understand what you mean about "to stream" it. Streaming is defined for audio and video as consecutively loading byte ranges. Are you implying that this also applies to text files? And that therefore this use case is already covered?
================================================================================

Comment 1 Ian 'Hixie' Hickson 2012-09-15 22:18:01 UTC

> Are you implying that this also applies to text files?

Yes. See as far back as point #2 above. I thought this was clear.


> And that therefore this use case is already covered?

No, currently if you tried to stream a <track> it would block the video from ever playing. We'd also want to have some sort of feature to detect when we had reached beyond the point at which the stream had sent us data, which is currently not possible (there's no way to distinguish lag from lack of tracks).

This problem is relatively easy to solve, e.g. we could add a block type to WebVTT that says "Ok, I have given you cues up to X" and when there's silence on the track, the WebVTT stream could just output these "null" cues regularly. Then we just say that as soon as you see one of those, the track is considered ready.

I still don't see an answer to issue 1 from point #3 above, though, which is the real blocker for me here. If every user gets a video stream with different timestamps, and the subtitle server doesn't have a way to know which timestamp the video server is using for any particular frame, I just don't see how to solve this.

Comment 2 Silvia Pfeiffer 2012-10-18 02:43:12 UTC

I've checked with how live captioning was used at Google I/O. The code for some of it is actually open source:
http://code.google.com/p/io-captions-gadget/

Google used http://streamtext.net/ to do the streaming of the text for them. StreamText have a special server and deliver their captions into an iFrame that the site embeds: https://streamtext.zendesk.com/entries/21705252-embedding-streaming-text-with-streamtext-into-your-web-pages

Basically, what happens is that the captions server provides for query parameters to adjust the display and position:

https://streamtext.zendesk.com/entries/21721966-controlling-the-streaming-text-page-display

In particular, there is a "last" parameter that tells the server what position of the streaming text file to serve from:
http://code.google.com/p/io-captions-gadget/source/browse/streamtext.py

I assume we want to support streaming text with a plain HTTP server instead of providing a custom server.

The concept of "last" can be replicated using @startDate and calculating the difference between @startDate and "now". That gives us the time synchronization between the video and the WebVTT file, assuming both started at @startDate. From here, the browser can either download the full existing WebVTT file and drop all past cues, or it could do a sequence of byte range requests to do a bisection search to the correct position in the file.

Comment 3 Ian 'Hixie' Hickson 2012-10-18 19:00:49 UTC

What's the difference between "a plain HTTP server" and "a custom server"?

It sounds like what they did is the old-fashioned way of doing EventSource.


> The concept of "last" can be replicated using @startDate and calculating the
> difference between @startDate and "now". That gives us the time
> synchronization between the video and the WebVTT file, assuming both started
> at @startDate. From here, the browser can either download the full existing
> WebVTT file and drop all past cues, or it could do a sequence of byte range
> requests to do a bisection search to the correct position in the file.

So you want each streaming user to have a different caption stream with different timestamps? That seems excessively complicated. Also, what if the user seeks back in the stream, to before the time at the start of the stream? Plus, how do you know what time "now" is, given that the user's clock is highly unreliable?

Anyway, if the video files have a reliable timeline offset, and start streaming relative to that timeline offset and don't just assume zero time is when the user connected, then the problem seems relatively easy to me — just have the captions be timed relative to the global zero time for the stream. My understanding, though, is that that isn't what happens. Each user connects and the first frame they get is time=0 for them (so by default each user has to have a caption stream with different time offsets, the very problem we're trying to prevent), and they don't have a reliable way of knowing what that time corresponds to (no official "initial time").

If we're ok with requiring that the video server report a correct "initial time" and if we're ok with requiring that the captions server offset the caption timestamps for each user, then the problem is indeed solved. I just assumed that neither of those were valid options.

Comment 4 Silvia Pfeiffer 2012-10-18 23:12:07 UTC

(In reply to comment #3)
> What's the difference between "a plain HTTP server" and "a custom server"?

A "custom server" is one that knows how to interpret the query parameters.


> It sounds like what they did is the old-fashioned way of doing EventSource.

I don't know what you mean by that. I couldn't find EventSource in the source code.


> So you want each streaming user to have a different caption stream with
> different timestamps?

I think we could do it without rewriting the timestamps. In fact, I think that's what the new HTTP Live Streaming spec does with WebVTT, where it provides a synchronization header: X-TIMESTAMP-MAP , see http://tools.ietf.org/html/draft-pantos-http-live-streaming-10 .


> If we're ok with requiring that the video server report a correct "initial
> time"

We already have that in the startDate, don't we?


> and if we're ok with requiring that the captions server offset the
> caption timestamps for each user,...

I'm not sure it's the best solution, but certainly that's what Apple's proposal does without rewriting all cue start/end times, but by offsetting them with a single global timestamp header.


> then the problem is indeed solved.

Maybe. I would actually prefer if in live streaming we could have a video timeline that represents how long that stream has been going already, so the user can seek back. Then we would also not need to rewrite text track timestamps. RTP/RTSP have headers that provide that information, see NTP timestamps at http://tools.ietf.org/html/rfc3550#section-6.4.1 . HLS is trying really hard to have that information, too, using EXT-X-DISCONTINUITY fields

Comment 5 Silvia Pfeiffer 2012-10-18 23:21:11 UTC

[Hae? I don't know how that got posted prematurely. Ignore the last comment.]

(In reply to comment #3)
> What's the difference between "a plain HTTP server" and "a custom server"?

A "custom server" is one that knows how to interpret the query parameters.


> It sounds like what they did is the old-fashioned way of doing EventSource.

I don't know what you mean by that. I couldn't find EventSource in the source code.


> So you want each streaming user to have a different caption stream with
> different timestamps?

I think we could do it without rewriting the timestamps. In fact, I think that's what the new HTTP Live Streaming spec does with WebVTT, where it provides a synchronization header: X-TIMESTAMP-MAP , see http://tools.ietf.org/html/draft-pantos-http-live-streaming-10 .


> If we're ok with requiring that the video server report a correct "initial
> time"

We already have that in the startDate, don't we?


> and if we're ok with requiring that the captions server offset the
> caption timestamps for each user,...

I'm not sure it's the best solution, but certainly that's what Apple's proposal does without rewriting all cue start/end times, but by offsetting them with a single global timestamp header.


> then the problem is indeed solved.

Maybe. The biggest problem I see is in your comment #1: "currently if you tried to stream a <track> it would block the video from ever playing." I did not think that was the case because there is the note about "dynamically updating the list of cues" in http://www.whatwg.org/specs/web-apps/current-work/multipage/the-video-element.html#rules-for-updating-the-text-track-rendering .


Also, I would actually prefer if in live streaming we could have a video timeline that represents how long that stream has been going already, so the user can seek back. Then we would also not need to rewrite text track timestamps. RTP/RTSP have headers that provide that information, see NTP timestamps at http://tools.ietf.org/html/rfc3550#section-6.4.1 . HLS is trying really hard to have that information, too, using EXT-X-DISCONTINUITY fields etc, see http://tools.ietf.org/html/draft-pantos-http-live-streaming-10 .

Comment 6 Silvia Pfeiffer 2013-02-12 00:50:23 UTC

I've just looked into XEP-0301 [1] which is a protocol to extend jabber to real-time text. It contains support for all the features that a live captioner's machine provides and could fairly easily be mapped into WebVTT:

* action elements: http://xmpp.org/extensions/xep-0301.html#list_of_action_elements
  - <t> insert text
       @p - insertion position inside cue (num of characters)
  - <e> erase text
       @p - deletion position inside cue (num of characters)
       @n - number of characters to remove using backspace
  - <w> wait (probably not that useful to us)

[1] http://xmpp.org/extensions/xep-0301.html

To use WebVTT for live captioning, we could package first the cue header and then successively the cue text in XEP-0301 <rtt> elements.

This would call for the introduction of the "action elements" <t> and <e> into WebVTT cue text.

The <rtt> element would take care of constructing the cue successively from the individual packets using the events [2]:
  - new: Begin a new real-time message (begin a cue)
  - reset: Reset the current real-time message (end previous cue and begin a cue)
  - init: Signals the start of real-time text (start of WebVTT transmission)
  - cancel: Signals the end of real-time text (end of WebVTT transmission)

[2]  http://xmpp.org/extensions/xep-0301.html#event

Incidentally, if we extend WebVTT with these <t> and <e> elements, the XEP-0301 <rtt> elements could also be used with Web RTC data channels [3] to transmit real-time text.

[3] http://dev.w3.org/2011/webrtc/editor/webrtc.html#idl-def-RTCDataChannel

Comment 7 Silvia Pfeiffer 2013-02-12 03:07:32 UTC

To be clear on my last comment: I am wondering what you think about introducing the <t> and <e> elements into WebVTT.

Comment 8 Ian 'Hixie' Hickson 2013-04-22 17:04:53 UTC

I thought we'd established that wasn't needed (see point #37 above).

As far as I can tell, we established that it's reasonable to expect streams to have defined _timeline offset_s (see comment 2), and the only thing that needs to be defined here is the streamed-download feature.

Basically this needs two things, as far as I can tell:

 - Incremental downloading of caption files, rather than waiting for the whole 
   thing to be downloaded before enabling playback.

 - A way to determine if we've received enough of a caption file to play back up
   to a certain time.

Comment 9 Silvia Pfeiffer 2013-06-04 12:43:42 UTC

(In reply to comment #8)
> I thought we'd established that wasn't needed (see point #37 above).

Right, ignore the stuff about RTT.


> As far as I can tell, we established that it's reasonable to expect streams
> to have defined _timeline offset_s (see comment 2), and the only thing that
> needs to be defined here is the streamed-download feature.
> 
> Basically this needs two things, as far as I can tell:
> 
>  - Incremental downloading of caption files, rather than waiting for the
> whole 
>    thing to be downloaded before enabling playback.
> 
>  - A way to determine if we've received enough of a caption file to play
> back up
>    to a certain time.

Yes, let's do these.

Comment 10 Ian 'Hixie' Hickson 2013-06-12 19:24:46 UTC

Ok, so how should we do this?:

 - A way to determine if we've received enough of a caption file to play back up
   to a certain time.

Once we have that, the previous step is reasonably straight-forward.

Comment 11 Silvia Pfeiffer 2013-06-15 12:36:53 UTC

(In reply to comment #10)
> Ok, so how should we do this?:
> 
>  - A way to determine if we've received enough of a caption file to play
> back up
>    to a certain time.
> 
> Once we have that, the previous step is reasonably straight-forward.


My take:

The <track @src> file should be tried frequently for new byte ranges as time marches on.

I'm for best effort: video playback should not wait for cues to arrive, so if they arrive too late, they won't get shown.

Comment 12 Simon Pieters 2013-06-17 08:53:04 UTC

(In reply to comment #11)
> My take:
> 
> The <track @src> file should be tried frequently for new byte ranges as time
> marches on.

This doesn't make sense to me. <track>s don't use byte range requests.

What we're looking for is to know when to start playback. Without streaming, we wait for the whole track to have loaded. With streaming, that would mean to wait forever, so that doesn't work, but we shouldn't not wait at all since that could mean cues are missed in the beginning.

How about:

We have received enough to play up to X if there is a cue with end time >= X, or the file is completely loaded or failed to load.

Comment 13 Silvia Pfeiffer 2013-06-17 10:08:33 UTC

(In reply to comment #12)
> (In reply to comment #11)
> > My take:
> > 
> > The <track @src> file should be tried frequently for new byte ranges as time
> > marches on.
> 
> This doesn't make sense to me. <track>s don't use byte range requests.

In the live situation, the WebVTT file is a growing file, just like the video. Correct me if I'm wrong, but FAIK the only way in which you can continue retrieving new cues is through HTTP byte range requests (after the first request).


> What we're looking for is to know when to start playback. Without streaming,
> we wait for the whole track to have loaded. With streaming, that would mean
> to wait forever, so that doesn't work, but we shouldn't not wait at all
> since that could mean cues are missed in the beginning.
> 
> How about:
> 
> We have received enough to play up to X if there is a cue with end time >=
> X, or the file is completely loaded or failed to load.

You don't actually know if there will ever be a cue with end time >= X. This is why I suggested a best effort.

Basically what I meant was:

* do a request on the WebVTT file
* do a request on the video file
* when the video file is ready to start playing, do a request on the WebVTT file to update your cues (if any)
* start playing the video (no matter if you have cues for that time)
* repeat the last two steps until end of video is reached

Since normally the WebVTT file is small and since the server would have some buffering strategy on the video to avoid putting video data out before the caption data is available, this should normally get the WebVTT cues to the client before the video data reaches that point.

Comment 14 Simon Pieters 2013-06-17 10:28:36 UTC

(In reply to comment #13)
> In the live situation, the WebVTT file is a growing file, just like the
> video. Correct me if I'm wrong, but FAIK the only way in which you can
> continue retrieving new cues is through HTTP byte range requests (after the
> first request).

No, you can just have a normal HTTP request and keep it open. Byte range requests are only necessary to support seeking to ranges that aren't buffered. That's not necessary to support streaming.


> You don't actually know if there will ever be a cue with end time >= X.

True. For instance, there could be silence the first minute of the stream, and then some spoken words and cues. It would be bad to wait to start playback in such a situation. So maybe instead we should start playback if the WebVTT header has been parsed or the file failed to load.

Comment 15 Silvia Pfeiffer 2013-06-17 10:41:49 UTC

(In reply to comment #14)
> (In reply to comment #13)
> > In the live situation, the WebVTT file is a growing file, just like the
> > video. Correct me if I'm wrong, but FAIK the only way in which you can
> > continue retrieving new cues is through HTTP byte range requests (after the
> > first request).
> 
> No, you can just have a normal HTTP request and keep it open. Byte range
> requests are only necessary to support seeking to ranges that aren't
> buffered. That's not necessary to support streaming.

If you have special server support, then the server can decide not to give you a content length and use a chunked encoding and thus the continue pushing the file to the client as it grows.

But on a standard Apache server, the client will be told a content length on the request and once the client has received that much data, it will need to make a second request to receive more, which means asking for a byte range starting at the previously received end of file.


> > You don't actually know if there will ever be a cue with end time >= X.
> 
> True. For instance, there could be silence the first minute of the stream,
> and then some spoken words and cues. It would be bad to wait to start
> playback in such a situation. So maybe instead we should start playback if
> the WebVTT header has been parsed or the file failed to load.

Right, that would be the minimum received data to wait for.

Comment 16 Simon Pieters 2013-06-17 11:36:06 UTC

(In reply to comment #15)
> If you have special server support, then the server can decide not to give
> you a content length and use a chunked encoding and thus the continue
> pushing the file to the client as it grows.

Right.

> But on a standard Apache server, the client will be told a content length on
> the request and once the client has received that much data, it will need to
> make a second request to receive more, which means asking for a byte range
> starting at the previously received end of file.

That doesn't match my understanding of how byte range requests work.

If the server gave a content-length and everything has been received, there's no reason for the client to do further requests, since it has it all. It doesn't make sense to give a content-length of a streaming resource that is shorter than the stream.

Comment 17 Silvia Pfeiffer 2013-06-17 12:07:03 UTC

(In reply to comment #16)
> > But on a standard Apache server, the client will be told a content length on
> > the request and once the client has received that much data, it will need to
> > make a second request to receive more, which means asking for a byte range
> > starting at the previously received end of file.
> 
> That doesn't match my understanding of how byte range requests work.
> 
> If the server gave a content-length and everything has been received,
> there's no reason for the client to do further requests, since it has it
> all. It doesn't make sense to give a content-length of a streaming resource
> that is shorter than the stream.

Assume that the WebVTT file is being written on the server while Apache serves it. There is only a cue added every few seconds. In the meantime Apache gets a get request, reads the file, assesses its size and returns the content length.

Here's an example of how that happens:
http://serverfault.com/questions/272841/apache-and-growing-files

Comment 18 Simon Pieters 2013-06-17 12:27:49 UTC

That doesn't seem like a good way to do streaming. I think we shouldn't support it for <track>.

Comment 19 Ralph Giles 2013-06-17 16:37:28 UTC

(In reply to comment #14)

> No, you can just have a normal HTTP request and keep it open. Byte range
> requests are only necessary to support seeking to ranges that aren't
> buffered. That's not necessary to support streaming.

I agree with Simon. The server keeping the connection open, with no Content-Length header, is how I envision live webvtt streaming working. This is how icecast-style streaming of live audio and video works.

Periodically requesting byte ranges beyond the end of the file doesn't make sense to me. If we do this in the general case the user agent is generating many unnecessary connections, or there's a special attribute enabling extra behaviour which hasn't been necessary for any other media type.

Further, periodic requests like this can be implemented in js with either the addCue() method or the media source extensions if an author wants to live implementation to happen clientside with a standard apache-style server.

> True. For instance, there could be silence the first minute of the stream,
> and then some spoken words and cues. It would be bad to wait to start
> playback in such a situation. So maybe instead we should start playback if
> the WebVTT header has been parsed or the file failed to load.

I'm fine with best-effort for live cues, but it would still be nice if loadedmetadata waited until some of the active texttrack data was available. I don't want a race with the network for the first few captions in the static file case, so we should wait until the parser is getting data, but we shouldn't block canplaythrough on having captions all the way to the end of the file.

Comment 20 Silvia Pfeiffer 2013-06-18 02:30:48 UTC

(In reply to comment #19)
> (In reply to comment #14)
> 
> > No, you can just have a normal HTTP request and keep it open. Byte range
> > requests are only necessary to support seeking to ranges that aren't
> > buffered. That's not necessary to support streaming.
> 
> I agree with Simon. The server keeping the connection open, with no
> Content-Length header, is how I envision live webvtt streaming working. This
> is how icecast-style streaming of live audio and video works.

So you're expecting to run a custom server for streamed WebVTT content?


> Periodically requesting byte ranges beyond the end of the file doesn't make
> sense to me. If we do this in the general case the user agent is generating
> many unnecessary connections, or there's a special attribute enabling extra
> behaviour which hasn't been necessary for any other media type.

Isn't that what browsers do for live video streaming through the video element?

Comment 21 Simon Pieters 2013-06-18 10:34:04 UTC

(In reply to comment #20)
> So you're expecting to run a custom server for streamed WebVTT content?

I'm not Ralph, but yes. It's necessary for streaming video anyway.

> Isn't that what browsers do for live video streaming through the video
> element?

No. If a video has a content-length and that gets downloaded, browsers assume the end has been reached (AFAIK). Which is perfectly reasonable since it matches HTTP semantics.

Comment 22 Simon Pieters 2013-06-18 10:38:31 UTC

(In reply to comment #19)
> I'm fine with best-effort for live cues, but it would still be nice if
> loadedmetadata waited until some of the active texttrack data was available.

That would take one minute in the scenario above. Delaying loadedmetadata also delays playback, so that's not good.

> I don't want a race with the network for the first few captions in the
> static file case, so we should wait until the parser is getting data, but we
> shouldn't block canplaythrough on having captions all the way to the end of
> the file.

I don't see much choice other than waiting until the WebVTT header has been parsed. We could say to wait a certain amount of time, but that seems to just add slowness and not solve the race. In practice, for the static file case, the first few cues will probably be in the same packet as the WebVTT header.

Comment 23 Silvia Pfeiffer 2013-06-18 10:47:57 UTC

(In reply to comment #21)
> (In reply to comment #20)
> > So you're expecting to run a custom server for streamed WebVTT content?
> 
> I'm not Ralph, but yes. It's necessary for streaming video anyway.
> 
> > Isn't that what browsers do for live video streaming through the video
> > element?
> 
> No. If a video has a content-length and that gets downloaded, browsers
> assume the end has been reached (AFAIK). Which is perfectly reasonable since
> it matches HTTP semantics.

OK, if we expect a special streaming server for video, we can similarly expect the same server to provide streaming text (e.g. icecast, HTTP adaptive streaming etc).

Comment 24 Ian 'Hixie' Hickson 2013-07-13 02:01:59 UTC

Waiting for the header to be parsed doesn't work.

Imagine (time goes vertically increasing downwards):

  VIDEO                  WEBVTT FILE
  header...              header...
  "Hello!"               cue saying "Hello"
  <silence>              <nothing>
  "Goodbye!"             cue saying "Goodbye"

Now suppose you're streaming this video and WebVTT file, and you're the client, and you've gotten:

  VIDEO                  WEBVTT FILE
  header...              header...
  "Hello!"               
  <silence>              

Should we play?

Say you've gotten:

  VIDEO                  WEBVTT FILE
  header...              header...
  "Hello!"               cue saying "Hello"

Should we play?

Say you've gotten:

  VIDEO                  WEBVTT FILE
  header...              header...
  "Hello!"               cue saying "Hello"
  <silence>

Should we play? How far should we play?

Say you've gotten:

  VIDEO                  WEBVTT FILE
  header...              header...
  "Hello!"               cue saying "Hello"
  <silence>
  "Goodbye!"

How far should we play?

Comment 25 Simon Pieters 2013-08-06 11:46:06 UTC

In theory the video can be downloaded faster than the track. The question is, does it happen often enough to be relevant in practice? Also, what other options are there?

Comment 26 Ian 'Hixie' Hickson 2013-08-06 22:09:07 UTC

(In reply to comment #25)
> In theory the video can be downloaded faster than the track. The question
> is, does it happen often enough to be relevant in practice?

Subtitles will often come from entirely different hosts, which may have very different latency characteristics. Subtitles are also far less likely to be cached (since they're used less). Also, if the video and subtitles come from different TCP streams, and only the subtitle one gets a lost packet, it's possible for that specific connection to get slowed down while the other does not. Also, if the stream is live, it's quite possible that the caption stream will be two seconds behind the video stream, simply because there's someone in the loop who has to actually write the captions. So with a good connection, so you have virtually no latency, you'll always be two seconds behind — if we don't have a way to handle this, we'll never show a caption live.

How often will it happen? I don't know, but it doesn't seem purely hypothetical, and the cost of not handling it is huge for users who are affected.


> Also, what other options are there?

It would be trivial to introduce a marker line in WebVTT to say "ok, you've received everything up to this time". Not necessary when there's subtitles, but necessary when there's silence. Or we could provide regular indices listing the times that are known to have subtitles. Or we could have subtitles for silence that say how long the silence is. There's lots of options. See also comment 1.

Comment 27 Silvia Pfeiffer 2013-08-09 03:56:34 UTC

(In reply to comment #26)
> Also, if the stream is live, it's quite possible that the caption
> stream will be two seconds behind the video stream, simply because there's
> someone in the loop who has to actually write the captions.

That is usually taken into account when setting up the live captioning system and not a problem we have to deal with.

This bug is not about solving the real-time captioning case where there is a video conference between peers and captioning happens at the same time. It is rather for the use case where a live video is broadcast with captions. In these situations, online broadcasters now typically delay the video by about 5s to give the captioner the chance to author captions that can then be distributed time-synchronous with the video at the correct times.

This is, thus, not a problem that we have to solve, but one we can assume has been solved and we have to display the captions with the video as they come.


> > Also, what other options are there?
> 
> It would be trivial to introduce a marker line in WebVTT to say "ok, you've
> received everything up to this time". Not necessary when there's subtitles,
> but necessary when there's silence. Or we could provide regular indices
> listing the times that are known to have subtitles. Or we could have
> subtitles for silence that say how long the silence is. There's lots of
> options. See also comment 1.

You are assuming that the caption track is driving the playback of the video and that it is not allowed to drop any caption cues on the floor.

I would rather suggest we go with a best effort model where the video timeline is driving the experience. Thus, once the setup data (i.e. header data) for video and captions has been received, we can go to loadedmetadata.

The way in which the "time marches on" algorithm has been written will then make sure that captions that arrive late, but are still active, will be displayed immediately. Also, if a caption arrives too late, at least its enter and exit events are raised.

It will work in most cases, because the video requires more bandwidth than the captions. If you caption server is too slow or too far away to deliver captions on time, it's not sufficiently performant and you'd better use a different one.

Comment 28 Simon Pieters 2013-08-09 08:00:32 UTC

Hixie is right that the caption stream can suffer from a packet loss which can delay data by several seconds (and packet loss isn't the server's fault).

I didn't understand Hixie's proposal before comment 26. I agree now that there are other options.

So let's consider the following scenario. The user plays a live video with live captions, and they're synchronised. The captions get a packet loss and data gets delayed by 5 seconds. What is the better user experience:

(1) The video continues to play and the cues during those 5 seconds are missed.
(2) The video stalls for 5 seconds (minus local buffer) and then continues to play with the cues.

If I understand correctly, Silvia prefers (1) while Hixie prefers (2).

Comment 29 Simon Pieters 2013-08-09 08:07:38 UTC

(In reply to comment #27)
> I would rather suggest we go with a best effort model where the video
> timeline is driving the experience. Thus, once the setup data (i.e. header
> data) for video and captions has been received, we can go to loadedmetadata.

FWIW, what you describe is more like "minimal effort", and Hixie's proposal requires more effort. :-)

Comment 30 Silvia Pfeiffer 2013-08-09 09:27:32 UTC

I'll ask some a11y folks that have done this to see what they prefer.

Comment 31 Ian 'Hixie' Hickson 2013-08-09 19:17:24 UTC

Suppose the sound track was separate from the video track, and the sound track got out of sync. Would we want to skip five seconds of audio, or wait til we had both?

I don't understand why this is even a question.

Comment 32 Silvia Pfeiffer 2013-08-10 08:49:46 UTC

Audio and video are continuous, while text tracks are discontinuous, so solutions that I've seen in the past have never relied on the next text track packet.

You are probably correct from a user's POV, though. If we make this a equally important track, we'd need to require the servers to artificially make the text tracks continuous with one of the options you're suggesting.

I think the easiest would be to require to never leave any gaps in the timeline and fill breaks with empty cues (or "keep-alive" cues). These could be auto-created by the system that posts the cues that were authored by the captioner and won't break the file format.

Then, the browser can wait for the cues to arrive as you propose.

Comment 33 Ian 'Hixie' Hickson 2013-10-01 21:42:52 UTC

So one interesting thing is that waiting for there to be one cue covering the current playback position means that if there's two cues, we might not wait long enough for the second one to be received. This seems problematic. Should there be some way to indicate that this is not the last cue that covers a particular moment in time, or something? Or maybe use some other mechanism, like a block type in VTT that specifically says "we're good up to time T"?

Comment 34 Simon Pieters 2013-10-02 08:35:23 UTC

If we add a "we're good up to time T" marker that is required for streaming to work properly, do we need https://www.w3.org/Bugs/Public/show_bug.cgi?id=23414 ?

Comment 35 Ian 'Hixie' Hickson 2013-10-03 18:39:53 UTC

I guess we could treat the first such marker as the way to signal that it's ok to not wait til the end of the file.

Comment 36 Ian 'Hixie' Hickson 2013-11-13 21:30:46 UTC

So what model do we want here? (I'm trying to work out whether to add the mechanism in bug 23414 or whether to just make it a hook that e.g. WebVTT
would use.)

Comment 37 Naresh Dhiman 2014-04-03 22:49:01 UTC

Created attachment 1462 [details]
Naresh Dhiman

Good

Comment 38 Silvia Pfeiffer 2014-05-05 08:34:13 UTC

(In reply to Ian 'Hixie' Hickson from comment #36)
> So what model do we want here? (I'm trying to work out whether to add the
> mechanism in bug 23414 or whether to just make it a hook that e.g. WebVTT
> would use.)

What would a hook for WebVTT look like?

I sort'a like the solution proposed in bug 23414, but I can't compare because I don't fully understand the alternative.

Comment 39 Ian 'Hixie' Hickson 2014-05-07 18:57:56 UTC

The hook would be some algorithm that you invoke from the WebVTT parser that says "ok, consider this file ready up to t=12s!" or some such.

Comment 40 Silvia Pfeiffer 2014-05-12 02:50:40 UTC

(In reply to Ian 'Hixie' Hickson from comment #39)
> The hook would be some algorithm that you invoke from the WebVTT parser that
> says "ok, consider this file ready up to t=12s!" or some such.

Would that be authored into a WebVTT stream? How do you envision this working from a production process POV?

Could the last received cue's end time provide this "ok, it's ready up to this time" information? If so, then the author could put empty cues into the stream when there is no text. That could cover a longer break a bit like keep-alives.

Comment 41 Ian 'Hixie' Hickson 2014-05-15 21:51:01 UTC

That's up to the VTT spec. Could be implied from receiving a cue, could be some in-band metadata in the VTT file, I dunno.

Comment 42 Philip Jägenstedt 2014-05-16 11:07:03 UTC

I see that this is marked "Needs Impl Interest". Is there any immediate interest to implement something here? The original bug was filed in 2011...

Comment 43 Anne 2016-03-17 05:30:41 UTC

Closing as WONTFIX due to lack of implementer interest.