This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 14104 - <track> Live captioning
Summary: <track> Live captioning
Status: RESOLVED WONTFIX
Alias: None
Product: HTML WG
Classification: Unclassified
Component: HTML5 spec (show other bugs)
Version: unspecified
Hardware: PC All
: P2 enhancement
Target Milestone: ---
Assignee: Silvia Pfeiffer
QA Contact: HTML WG Bugzilla archive list
URL:
Whiteboard:
Keywords:
: 24161 (view as bug list)
Depends on:
Blocks:
 
Reported: 2011-09-11 15:02 UTC by Silvia Pfeiffer
Modified: 2016-04-15 22:49 UTC (History)
14 users (show)

See Also:


Attachments

Description Silvia Pfeiffer 2011-09-11 15:02:29 UTC
Several service providers currently provide live captioning through something called "streaming text". This means that a text file is provided that contains the captions and the file is grown over time with new cues being appended.

This use case cannot currently be supported by the track element, because the video element waits to go to ready state until the complete WebVTT file is loaded and then never re-checks if anything has been added to the WebVTT file. This approach makes sense for the canned case, but for live streaming text it won't work.

It would be good to support this use case. There is a fundamental difference between streaming text and canned text track files in that the first one provides cues unreliably, while the second one has a fixed and pre-defined list of cues. Thus, it is probably necessary to put some sort of flag onto streaming text tracks - maybe a @live attribute or something similar. This would remove the requirement to wait with ready state for full loading of the WebVTT file and would require the browser to continue adding cues when the file size changes.
Comment 1 Ian 'Hixie' Hickson 2011-09-14 22:23:06 UTC
In the streaming case, how do you know what timestamps to use?
Comment 2 Ralph Giles 2011-09-15 01:07:31 UTC
Playback is generally happening inside the context of a media element, so we would compare the timestamps on the cues against the currentTime attribute of that media element to determine which are active.

The way I envisioned this working is that a <track> elements points to a src url handled by a server which holds the connection open, sending new WebVTT cues as they become available. The media element would then make a best effort to display those cues at the appropriate time.

There is certainly an issue with knowing whether you have the next cue or not. The current spec addresses this by requiring that the entire caption file be loaded before playback can begin. While one can certainly implement the live case with XHR and mutable text tracks, I think it is preferable to allow live streaming of captions with just the <track> element.

The user agent does know when it has the next cue in the stream, so one possible solution is to trigger ready state on that. Or we could leave it to best effort in the live case and see how it works.
Comment 3 Philip Jägenstedt 2011-09-15 06:51:45 UTC
There are two problems here:

1. In the live case currentTime will start at 0 when you start watching, so a different WebVTT file has to be sent to every user.

2. There is currently no way to differentiate between a slow network and a live stream, so how would one know when to go to HAVE_METADATA?
Comment 4 Ian 'Hixie' Hickson 2011-09-15 23:02:28 UTC
Problem #1 is what I was referring to. I don't understand #2, can you elaborate?
Comment 5 Philip Jägenstedt 2011-09-16 08:16:37 UTC
(In reply to comment #3)

> 2. There is currently no way to differentiate between a slow network and a live
> stream, so how would one know when to go to HAVE_METADATA?

Currently, loading the tracks block the media element readyState, it will go to HAVE_METADATA only when the track is loaded. If the track doesn't finish loading because the connection is intentionally kept open, the video will never be able to play. I can't see a way to make an exception for the streaming case, because there's no difference on the HTTP level between streaming and the network being slow.
Comment 6 Silvia Pfeiffer 2011-09-17 15:20:39 UTC
(In reply to comment #3)
> There are two problems here:
> 
> 1. In the live case currentTime will start at 0 when you start watching, so a
> different WebVTT file has to be sent to every user.
> 
> 2. There is currently no way to differentiate between a slow network and a live
> stream, so how would one know when to go to HAVE_METADATA?

#1 us solved by initialTime IIUC.
http://www.whatwg.org/specs/web-apps/current-work/multipage/the-video-element.html#dom-media-initialtime

When the browser connects to the live video stream, the currentTime is 0, but the initialTime will be whatever time has passed since starting to stream it. Then the browser can use that initialTime as an offset to take it off the time stamps provided in the WebVTT file.
Comment 7 Philip Jägenstedt 2011-09-19 08:07:53 UTC
AFAICT, initialTime is for an initial offset that is actually on the timeline, e.g. given by a media fragment URI, not for a position that is before the stream begun. There is also startOffsetTime, but for that to be usable the captions themselves would also need to have a start date.

In any case, do you mean that the browser will natively sync the captions of live streams to make up for the timeline difference, or that scripts will be able to do so?
Comment 8 Ian 'Hixie' Hickson 2011-09-19 22:44:09 UTC
Without a better understanding of how this is intended to work, I don't know how to fix this.

Starting to play a video before the cues have all been received seems like a bad idea in the general case, since the cues might be out of order, there might be an arbitrarily large number of cues at the very first frame of video, etc. In fact, WEBVTT simply isn't set up to handle streaming — given a situation where the UA has received the first 5 minutes of video and has received 5MB of cues including one at 4 minutes and 55 seconds, even if the cues were assumed to be ordered, there'd be no way to know whether all the cues had been received yet or not. If they had not, you'd want to pause or you'd miss cues (possibly _all_ the cues, if the cue stream is just a few seconds behind where the video stream is at, and the user never pauses to let it catch up).

The timing issue is also pretty serious. Since a streaming video can be joined at an arbitrary time, and that time is considered t=0 unless the stream has explicit timestamps (a pretty advanced feature as far as I can tell — does anyone support this and have a video that they can demonstrate it with?), there's simply no way that I can see for the system to know what times to use in the cues except for the server to somehow recognise which video stream was associated with which user — and then for the system to handle restarts. This is especially problematic as presumably you wouldn't want unique timings for each user anyway.

WebVTT was designed for static resources, where the users creating the subtitles are as likely as not to be independent of the users creating the videos. For dynamic streams with captions, it seems highly unlikely that you'd have anyone doing the captions other than someone directly affiliated with the original stream, and in that case you really should just have the stream itself contain the titles. As far as I can tell that would solve all these problems neatly.
Comment 9 Silvia Pfeiffer 2011-09-27 06:56:24 UTC
I'm pretty sure that if we don't solve this, people will work around it in JavaScript by e.g. continuously reloading a new @src into the active <track> element, going back to the same file, which has in the meantime changed size and has some additional cues at the end of the previously retrieved byte range end.

We can leave this for now and solve it at a later stage if we prefer to encourage people to use the JavaScript API rather than the <track> element for the live use case.
Comment 10 Ian 'Hixie' Hickson 2011-10-01 00:15:06 UTC
The JS API was actually designed in part for this purpose, so you could stream cues to add to through the API.

Note that doing it by constantly reloading the src="" wouldn't work, for the reasons given in comment 8 paragraph 3.
Comment 11 Silvia Pfeiffer 2011-10-01 08:42:49 UTC
(In reply to comment #10)
> The JS API was actually designed in part for this purpose, so you could stream
> cues to add to through the API.

I appreciate that. I expect, though, that there will be two ways of dealing with "streaming text" - one that will be fully JS based, and one that will be file-based.


> Note that doing it by constantly reloading the src="" wouldn't work, for the
> reasons given in comment 8 paragraph 3.

That can be overcome by providing the cues always wrt the video's start-time and giving the page information about how much time has passed since the video's original start time.


(In reply to comment #7)
> AFAICT, initialTime is for an initial offset that is actually on the timeline,
> e.g. given by a media fragment URI, not for a position that is before the
> stream begun.

So initialTime and a media fragment URI's offset time are identical - I would think that we don't need initialTime then, since we can get it out of the URI.


> There is also startOffsetTime, but for that to be usable the
> captions themselves would also need to have a start date.

Yeah, that maps the video's zero time to a date, which isn't quite what we need.

What we need is basically a secondsMissed, which is the number of seconds that have passed since the start of the stream which the viewer has missed when joining this stream live. Given that the times in the WebVTT file would be relative to that original start time, you can calculate when the cues would need to be presented.

> In any case, do you mean that the browser will natively sync the captions of
> live streams to make up for the timeline difference, or that scripts will be
> able to do so?

Being able to use the native display would be the goal.

For scripts to be able to do so, they need the secondsMissed information, too, which they would need to get from a data-* attribute from the server. Then scripts would be able to do a custom caption display.

So, I guess what we would need to change to support this use case are the following:
* introduce a secondsMissed attribute for live streams
* introduce a reload mechanism for <track> elements
* introduce a "next" end time keyword in WebVTT
Comment 12 Ian 'Hixie' Hickson 2011-10-25 04:40:54 UTC
Could you elaborate on those bullet points?
Comment 13 Silvia Pfeiffer 2011-10-29 08:56:28 UTC
(In reply to comment #12)
> Could you elaborate on those bullet points?

Sure.

The @secondsMissed attribute would be an attribute on the media element that says how many seconds ago the stream had started, i.e. what time offset currentTime=0 maps to. Given that the times in the WebVTT file would be relative to that original start time, you can calculate when the cues would need to be presented at by calculating currentTime + secondsMissed as the video's playback time.

The reload mechanism on the <track> elements would mean that when the currentSrc resource's last cue has been read and is before the end of the video, while the video continues to load, do a HTTP byte range request for bytes on the currentSrc resource after the end of the file. E.g. if the currentSrc resource is 600KB long, then do a GET request with Range: bytes=600- . This would be repeated while more video is being downloaded and stopped otherwise. The repetition frequency would likely be attached to the request rate of the video.


As for the change in WebVTT, the idea is that the following would be legal:

WEBVTT

00:00:00.000 --> next
This cue lasts until the next cue replaces it, i.e. 5.6 sec.

00:00:05.600 --> next
Same here, i.e. 4.4 sec.

00:00:10.000 --> next
If none follows, the cue lasts until the end of the video.


These three together would allow for live captioning through streaming text.
Comment 14 Ian 'Hixie' Hickson 2011-10-30 18:02:16 UTC
I don't understand how anyone would be able to fill in secondsMissed="".
Comment 15 Silvia Pfeiffer 2011-10-31 02:02:07 UTC
(In reply to comment #14)
> I don't understand how anyone would be able to fill in secondsMissed="".

The server knows how long the video had already been running, so it can fill it in. It does require a server component to keep track of how long the video has been streaming, but that information is typically available to the server, in particular when the server is recording the stream at the same time like ustream and livestream do. For example, this ustream video starts at 15:45 http://www.ustream.tv/recorded/18146941/highlight/212855 rather than 0.
Comment 16 Ian 'Hixie' Hickson 2011-11-02 19:47:57 UTC
That doesn't work, because it doesn't take into account the delay between the server writing the HTML page and the client parsing it.

Also, if you can update the media server such that it has such close control over the HTTP server, then why not just put the text tracks in the media directly?
Comment 17 Silvia Pfeiffer 2011-11-02 21:04:37 UTC
OK, I guess we can leave the attribute for now.

The idea of introducing a "next" or similar special value for end times into WebVTT is still a useful one. Could we still explore this then.
Comment 18 Ian 'Hixie' Hickson 2011-11-03 16:25:57 UTC
"next" seems like merely syntactic sugar, so I don't know how helpful it would be (writing subtitles is, I hope, mostly done using editing tools). But please file a separate bug for that if it has good use cases.

As far as this bug goes, I find myself back at comment 8. I don't see how to do this in a sane way.
Comment 19 Ian 'Hixie' Hickson 2011-11-11 20:02:58 UTC
EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are satisfied with this response, please change the state of this bug to CLOSED. If you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document:
   http://dev.w3.org/html5/decision-policy/decision-policy.html

Status: Did Not Understand Request
Change Description: no spec change
Rationale: Not sure how to resolve this with out-of-band text tracks.
Comment 20 Silvia Pfeiffer 2011-11-12 21:58:35 UTC
If you go to any of the videos on http://www.youtube.com/live , they all show as the time at the beginning of the video the actual time that has passed since the video started streaming. These times are easily matched to the time stamps on a streaming text service such as http://streamtext.net/.

In fact, YouTube have used the services of a streaming text service before to caption live video: http://gigaom.com/video/youtube-launches-live-captions-at-google-io/. This ran exactly how I described: when a viewer started watching the video, they also received the streaming text with the time stamps since the beginning of the transmission, which the video player was then able to synchronize.

It is possible to implement this fully in JavaScript with the current specifications so I am not going to reopen this. But I expect that with the progress that we make for WebRTC we may also need to revisit this issue and wanted to leave more details on this bug.
Comment 21 Ian 'Hixie' Hickson 2011-11-28 23:56:52 UTC
For those live streams, the video seems to include an internal time, which the captions presumably use as well. So that's rather different than what you were proposing.

For that kind of case, what we'd really want is not a static file to download, it would be a stream. You'd want to tell the server around when to start (presumably automatically), and you'd want to update the cues in real time, presumably throwing cues away that are before the earliest start time.

That doesn't seem too unreasonable.

To support things like inline live corrections, though, we'd probably want a different format than WebVTT, or at least some variant on it. e.g.:

--------------8<--------------
WEBVTT

00:00.000 --> 00:05.000
captions that were available before the user connected

01:00:00.000 --> 01:02:00.000
bla bla bla

LIVE--> align:middle rollup-size:3
<01:03:11.004> Bla bla <01:03:11.353> bla <rollup> <01:03:11.653> bal <01:03:11.710> <redoline> bla <01:03:12.004> bla bla...
-------------->8--------------

...where in a LIVE block, timestamps indicate when the following text should appear, <rollup> indicates when to go to the next line, <redoline> indicates that the current line should be deleted in favour of new text... This is just a strawman, I don't know what the right solution is here.

In particular, what should happen if you seek backwards to a point between when a line was rolled up and a correction was made? Should we show the incorrect text, or should the incorrect text be dropped in favour of the new text? When should the new text appear, should it appear at the time of the incorrect text? How should corrections be made? Should anything be allowed after a LIVE block, or is a LIVE block forcibly the last block of a file?
Comment 22 Silvia Pfeiffer 2011-12-04 01:53:49 UTC
(In reply to comment #21)
> For those live streams, the video seems to include an internal time, which the
> captions presumably use as well. So that's rather different than what you were
> proposing.

It's the time since the stream was started, and that's exactly what I was referring to. I don't understand how that makes a difference.


> For that kind of case, what we'd really want is not a static file to download,
> it would be a stream.

Agreed. That's what I meant with a "streaming text" file.


> You'd want to tell the server around when to start
> (presumably automatically), and you'd want to update the cues in real time,
> presumably throwing cues away that are before the earliest start time.

The streaming text file is a separate resource from the video and it contains cues with times synchronized with the beginning of the video. New cues are added at the end of the file. It can be either the server throwing away captions that are from before the earliest start time, or it can be the browser which knows the start time of the video and can tell which cues are in the past.


> That doesn't seem too unreasonable.

Cool.

> To support things like inline live corrections
>, though, we'd probably want a
> different format than WebVTT, or at least some variant on it. e.g.:
> 
> --------------8<--------------
> WEBVTT
> 
> 00:00.000 --> 00:05.000
> captions that were available before the user connected
> 
> 01:00:00.000 --> 01:02:00.000
> bla bla bla
> 
> LIVE--> align:middle rollup-size:3
> <01:03:11.004> Bla bla <01:03:11.353> bla <rollup> <01:03:11.653> bal
> <01:03:11.710> <redoline> bla <01:03:12.004> bla bla...
> -------------->8--------------
> 
> ...where in a LIVE block, timestamps indicate when the following text should
> appear, <rollup> indicates when to go to the next line, <redoline> indicates
> that the current line should be deleted in favour of new text... 

I'd like to keep the rollup and redo-line problems separate. The rollup problem is applicable not only to live captioning, but as a general problem. We have a discussion in the Text Tracks Community Group about it right now with different options, so I'd like to defer the problem there. Also, the redo-line problem is a new one that again should be solved independently from live captioning.

So, I just want to focus on the timing part of this problem, which is also a <track>-related problem, not just a WebVTT problem.

Your suggestion of introducing a "LIVE" cue without timing has one big problem: all captions for a video end up being in a single cue. That's not readable, not easy to edit, and hardly easy to re-stream: it would be difficult to determine what is still active when seeking to a specific offset.

My approach was to allow cues to be active until the next cue appears. (Incidentally, for rollup captions this could be adapted to being active until the next three cues appear.)

For example instead of this (endless) cue:

--
> LIVE--> align:middle rollup-size:3
> <01:03:11.004> Bla bla <01:03:11.353> bla <rollup> <01:03:11.653> bal
> <01:03:11.710> <redoline> bla <01:03:12.004> bla bla...
--

you would have something like:

--
01:03:11.004 --> NEXT(3) align:middle
<01:03:11.004> Bla bla <01:03:11.353> bla

01:03:11.653 --> NEXT(3)
bal <01:03:11.710> <redoline> bla <01:03:12.004> bla bla...
--

The third start time of a cue after the current cue should be easy to determine in code.


> This is just a
> strawman, I don't know what the right solution is here.

Yeah, I am not 100% sure what is best either, but finding advantages/disadvantages with some markup is certainly good.


> In particular, what should happen if you seek backwards to a point between when
> a line was rolled up and a correction was made? Should we show the incorrect
> text, or should the incorrect text be dropped in favour of the new text?


During live streaming, no seeking should be possible. So, that problem would not occur. Usually for captions that were done live, there is some post-production. The post-production would typically remove all redone characters.

Also, events are handled as they are reached, so if the redone stuff is still there, then playback would exactly replicate the original changes again, which it should.
Comment 23 Philip Jägenstedt 2011-12-05 10:12:00 UTC
(In reply to comment #22)

> During live streaming, no seeking should be possible.

A live stream is basically just a resource which is not byte range seekable, but that doesn't mean that the client can't seek in the data they have buffered. I think Firefox already does this and we (Opera) want to do it.
Comment 24 Ian 'Hixie' Hickson 2011-12-05 22:20:30 UTC
Roll-up captions are, as far as I can tell, just one cue, that happens to have internal state (much like a karaoke cue, actually).

The problem with using the current cue concept for live cues is that you don't know when the cue will end (which it may well do before the end of the stream, e.g. if the live stream has prerecorded segments spliced in, e.g. ads), yet our cue format puts the end time before the cue text. Hence the desire for a different format for live cues.

Streaming definitely doesn't preclude seeking. Live cues definitely don't preclude streaming. The question of what to do with edited cues when seeking back seems quite valid to me.
Comment 25 Ian 'Hixie' Hickson 2011-12-05 22:38:15 UTC
Another issue we have to consider is what to do when the latency on the subtitle stream is such that it is several seconds behind the video stream. If we don't have a mechanism where the current time is being transmitted continuously even when no captions are to be shown, there's not really any way I can see for the UA to know whether or not it is missing captions (and should stall the video) or not.
Comment 26 Ian 'Hixie' Hickson 2011-12-05 23:29:50 UTC
Interestingly, the thread to which I replied here:
   http://lists.w3.org/Archives/Public/public-texttracks/2011Dec/0033.html
...includes several anecdotal data points (and one reference to some research) suggesting that even for live captioning, we might want to focus on pop-up captions and not support roll-up captions.
Comment 27 Silvia Pfeiffer 2011-12-07 23:30:42 UTC
(In reply to comment #23)
> (In reply to comment #22)
> 
> > During live streaming, no seeking should be possible.
> 
> A live stream is basically just a resource which is not byte range seekable,
> but that doesn't mean that the client can't seek in the data they have
> buffered. I think Firefox already does this and we (Opera) want to do it.

That's ok. It's client-side seeking only. In this case, I would simply replay everything exactly how it was received before, including the edits.
Comment 28 Silvia Pfeiffer 2011-12-08 00:31:08 UTC
(In reply to comment #24)
> Roll-up captions are, as far as I can tell, just one cue, that happens to have
> internal state (much like a karaoke cue, actually).

Roll-up is a means of displaying lines of text. It doesn't matter whether they are in one cue or in many cues. Text from several cues should be able to be added to a previous cue and make it roll up.

 
> The problem with using the current cue concept for live cues is that you don't
> know when the cue will end (which it may well do before the end of the stream,
> e.g. if the live stream has prerecorded segments spliced in, e.g. ads), yet our
> cue format puts the end time before the cue text. Hence the desire for a
> different format for live cues.

Again: let me decouple rollup from this problem, since rollup is a means of display, not a timing means. I want to focus on the timing issues.

Even cues for which we don't know the end time at the time of their creation, there is a time when the end time is known. This time is typically the appearance of another cue. Thus, a cue's end time can be set in relation to the start time of a future cue.

This is the problem that I tried to solve. It is independent of rollup, because this may happen with pop-on captions, too.
Comment 29 Philip Jägenstedt 2011-12-09 10:03:04 UTC
(In reply to comment #27)
> (In reply to comment #23)
> > (In reply to comment #22)
> > 
> > > During live streaming, no seeking should be possible.
> > 
> > A live stream is basically just a resource which is not byte range seekable,
> > but that doesn't mean that the client can't seek in the data they have
> > buffered. I think Firefox already does this and we (Opera) want to do it.
> 
> That's ok. It's client-side seeking only. In this case, I would simply replay
> everything exactly how it was received before, including the edits.

I don't understand, when we seek we invalidate what cues are visible and start fresh. It has to be defined what happens with the changes you propose, since if the cues are mutated by later cues' arrival one can't "simply replay everything exactly how it was received".
Comment 30 Silvia Pfeiffer 2011-12-11 09:40:54 UTC
(In reply to comment #29)
> (In reply to comment #27)
> > (In reply to comment #23)
> > > (In reply to comment #22)
> > > 
> > > > During live streaming, no seeking should be possible.
> > > 
> > > A live stream is basically just a resource which is not byte range seekable,
> > > but that doesn't mean that the client can't seek in the data they have
> > > buffered. I think Firefox already does this and we (Opera) want to do it.
> > 
> > That's ok. It's client-side seeking only. In this case, I would simply replay
> > everything exactly how it was received before, including the edits.
> 
> I don't understand, when we seek we invalidate what cues are visible and start
> fresh. It has to be defined what happens with the changes you propose, since if
> the cues are mutated by later cues' arrival one can't "simply replay everything
> exactly how it was received".

OK, this is way off topic from what the bug was originally registered for. But I'll run with it.

Ian suggested to introduce an "editing" command into cues called <redoline>. This is a command that, along the timeline changes something that has been displayed before. For example a line that starts at with the text "I ma hnugry" would be erased a few seconds later with the <redoline> command and replaced with the text "I am hungry". As I understand it, the markup would look something like this:

LIVE--> rollup-size:3
<00:00:00.000> I ma hnugry <00:00:05.000> <redoline> I am hungry

This specifies a clear display order, even with the changes.

So, when you seek back to any time between 0 and 5 sec, you display the "I ma hnugry" text again, and for any time after 5 sec, you display the "I am hungry" text. This is what I mean by "replay everything exactly how it was received".

Hope that clarifies what I meant.
Comment 31 Philip Jägenstedt 2011-12-11 11:02:49 UTC
(In reply to comment #30)

> Hope that clarifies what I meant.

I does, although what makes sense here strongly depends on what the in-memory model for the cues is supposed to be. If <redoline> mutates the existing cues then it'd (much) simpler to just show the correct version. The alternative is that we have special rendering rules for live cues that collapse lines.
Comment 32 Ian 'Hixie' Hickson 2012-01-31 22:26:26 UTC
(In reply to comment #30)
> 
> LIVE--> rollup-size:3
> <00:00:00.000> I ma hnugry <00:00:05.000> <redoline> I am hungry
> 
> So, when you seek back to any time between 0 and 5 sec, you display the "I ma
> hnugry" text again, and for any time after 5 sec, you display the "I am hungry"
> text. This is what I mean by "replay everything exactly how it was received".

That seems like a net worse user experience than only showing the correctly-spelt text.
Comment 33 Silvia Pfeiffer 2012-02-01 11:14:34 UTC
(In reply to comment #32)
> (In reply to comment #30)
> > 
> > LIVE--> rollup-size:3
> > <00:00:00.000> I ma hnugry <00:00:05.000> <redoline> I am hungry
> > 
> > So, when you seek back to any time between 0 and 5 sec, you display the "I ma
> > hnugry" text again, and for any time after 5 sec, you display the "I am hungry"
> > text. This is what I mean by "replay everything exactly how it was received".
> 
> That seems like a net worse user experience than only showing the
> correctly-spelt text.

I think it's unfaithful to have the browser make changes to the presentation when seeking back to it. What if the viewer has just seen something funny being typed and wants to rewind to it to show his friend who missed? 

I think we should leave such changes up to a content editor who is going to re-publish the live streamed video & captions with improvements, be that fixes to the video, or fixes to the captions.
Comment 34 Philip Jägenstedt 2012-02-06 12:17:42 UTC
(In reply to comment #33)
> (In reply to comment #32)
> > (In reply to comment #30)
> > > 
> > > LIVE--> rollup-size:3
> > > <00:00:00.000> I ma hnugry <00:00:05.000> <redoline> I am hungry
> > > 
> > > So, when you seek back to any time between 0 and 5 sec, you display the "I ma
> > > hnugry" text again, and for any time after 5 sec, you display the "I am hungry"
> > > text. This is what I mean by "replay everything exactly how it was received".
> > 
> > That seems like a net worse user experience than only showing the
> > correctly-spelt text.
> 
> I think it's unfaithful to have the browser make changes to the presentation
> when seeking back to it. What if the viewer has just seen something funny being
> typed and wants to rewind to it to show his friend who missed? 
> 
> I think we should leave such changes up to a content editor who is going to
> re-publish the live streamed video & captions with improvements, be that fixes
> to the video, or fixes to the captions.

If so, what should the in-memory model be? What do you see via the DOM APIs?
Comment 35 Silvia Pfeiffer 2012-02-06 21:24:09 UTC
(In reply to comment #34)
> > > > 
> > > > LIVE--> rollup-size:3
> > > > <00:00:00.000> I ma hnugry <00:00:05.000> <redoline> I am hungry
> 
> If so, what should the in-memory model be? What do you see via the DOM APIs?

We haven't mapped <redoline> to anything in HTML5 yet. It should probably map to a "display: none" on the previous span and then create a new one at the given time instant.
Comment 36 Ian 'Hixie' Hickson 2012-04-19 23:44:13 UTC
You really want to optimise for making fun at transcription errors rather than for caption quality?
Comment 37 Silvia Pfeiffer 2012-04-20 01:26:51 UTC
I'd actually prefer if we didn't get side tracked in this bug with introducing markup for editing cues. I don't actually think there is a big requirement for editing on the Web, since for live captioning & streaming we can delay the transmission to make sure the captioner has finished fixing their transcript.

What I was trying to focus on in this bug was not the WebVTT markup, but the problem that WebVTT files may get changed while the video is playing back and that this is a legal use case for live streaming and that in this case the browser needs to reload the WebVTT file frequently.

The problems that Philip lists can be addressed:

#1 would make use of the startDate together with the currentTime to know when to display cues

#2 every time the WebVTT file has been loaded, the browser returns to HAVE_METADATA - it doesn't need to wait

In fact, the server could be clever and provide to newly connected clients just the cues that relate to the part of the video that they are currently looking at. Though in this case it's almost identical to loading the captions via JS.

What the browser still doesn't know is when it has to reload the WebVTT file. That's why I suggested a @live attribute which would cause the browser to frequently reload the WebVTT file.

I can see three different ways to trigger a reload:
* whenever the browser runs out of cues for a video (in fact, this could likely be useful in general)
* when a startDate is given on the video and the browser runs out of cues
* when a @live attribute is given on the video and the browser runs out of cues

Also, we'd need to limit the number of reload tries if there are no changes to the WebVTT file.
Comment 38 Ian 'Hixie' Hickson 2012-06-26 20:12:08 UTC
I had assumed that live subtitling would necessarily need to include support for live subtitling correction. If this is not the case, that changes the design space substantially. If anything, it makes the issue in comment 25 more critical, since we can no longer show the cues incrementally but need to know when we have the whole cue and when we're missing the next cue.

I don't understand what you mean about reloading the WebVTT file. Surely the only sane way to do live streaming is to stream.
Comment 39 Silvia Pfeiffer 2012-06-29 04:46:16 UTC
(In reply to comment #38)
> I had assumed that live subtitling would necessarily need to include support
> for live subtitling correction. If this is not the case, that changes the
> design space substantially.

It is not the case. It's a feature that we may introduce at a later stage, but not one I've seen used anywhere on the Web in live streamed captions.

> If anything, it makes the issue in comment 25 more
> critical, since we can no longer show the cues incrementally but need to know
> when we have the whole cue and when we're missing the next cue.

The video continues to provide the timeline, of course. The browser can only do best effort. If there is a cue that has to be displayed at a certain time, because the video's time has reached it, but the cue has not been received yet by the browser because the latency on the subtitle stream is higher than on the video stream, then it can't be displayed (the browser wouldn't even know it existed). However, video requires more bandwidth than text tracks in general, so I don't see this problem occuring frequently.
 
> I don't understand what you mean about reloading the WebVTT file. Surely the
> only sane way to do live streaming is to stream.

I don't understand what you mean about "to stream" it. Streaming is defined for audio and video as consecutively loading byte ranges. Are you implying that this also applies to text files? And that therefore this use case is already covered?
Comment 40 contributor 2012-07-18 15:59:12 UTC
This bug was cloned to create bug 18029 as part of operation convergence.
Comment 41 Silvia Pfeiffer 2014-06-06 06:03:21 UTC
*** Bug 24161 has been marked as a duplicate of this bug. ***
Comment 42 Travis Leithead [MSFT] 2016-04-15 22:49:47 UTC
HTML5.1 Bugzilla Bug Triage: It looks like after the discussion, several ideas were put forth, but nothing materialized as a concrete proposal. I suspect this idea may have legs, but needs more refinement. Rather than continuing that in this bug, this should grow out of an incubation effort.


This bug constitutes a request for a new feature of HTML. Our current guidelines, rather than track such requests as bugs or issues, is to create a proposal for the desired behavior, or at least a sketch of what is wanted (much of which is probably contained in this bug), and start the discussion/proposal in the WICG (https://www.w3.org/community/wicg/). As your idea gains interest and momentum, it may be brought back into HTML through the Intent to Migrate process (https://wicg.github.io/admin/intent-to-migrate.html).