This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 25910 - [WebVTT] A way to correct cues
Summary: [WebVTT] A way to correct cues
Status: NEW
Alias: None
Product: TextTracks CG
Classification: Unclassified
Component: WebVTT (show other bugs)
Version: unspecified
Hardware: All All
: P2 normal
Target Milestone: ---
Assignee: This bug has no owner yet - up for the taking
QA Contact: Web Media Text Tracks CG
URL:
Whiteboard: v2
Keywords:
Depends on:
Blocks:
 
Reported: 2014-05-28 13:59 UTC by Brendan Long
Modified: 2014-10-10 09:07 UTC (History)
3 users (show)

See Also:


Attachments

Description Brendan Long 2014-05-28 13:59:27 UTC
*Why*

In streaming text tracks, we need a way to fix incorrect cues. Some 
examples:

  * In live TV, people type the captions in by hand
    <http://en.wikipedia.org/wiki/Closed_captioning#Television_and_video> shortly
    before you see them. If they make a mistake, we need a way to fix it.
  * CEA-608 and CEA-708 captions don't start with a convenient startTime
    --> endTime block like WebVTT does. A caption ends when we get a
    command that makes it stop displaying. If we want to transcode to
    WebVTT in real-time, we have to either wait until the caption is
    over to translate it (delaying the stream by some arbitary time in
    the hope that it will be long enough), or we need to start a caption
    immediately with a guess of the end time and then rewrite it once we
    know the correct end time (or rewrite it to extend the end time
    until we find the correct one).

*How*

The solution I'm proposing is that if we see two cues with the same id, 
the earlier cue will be removed.

    some-id
    00:00:00 --> 00:00:30
    This is an xeample

    some-id
    00:00:00 --> 00:00:10
    This is an example

In this example, the text "This is an example" will be displayed for 10 
seconds starting at time 0.

*Why This Solution*

This solution is nice because the syntax is simple and easy to 
understand, and it's powerful enough to rewrite any cue in any way you 
could possibly want, because the new cue completely replaces the old one.

*Arguments against*

This isn't particularly efficient. If you just want to change the time, 
you need to send the entire updated cue, instead of just the change.

I don't think this is a big deal, because even the most heavily edited 
subtitle file will be orders of magnitude smaller than the accompanying 
video.

See:

http://lists.w3.org/Archives/Public/public-html/2014May/0020.html

I originally proposed doing this in HTML, but Philip convinced me that that's not a good idea. Having the WebVTT parser should be easy though.
Comment 1 Philip Jägenstedt 2014-05-28 14:59:19 UTC
If done as part of the parser, I think an explicit syntax for this would be safer, since if you're copying and pasting some cues with numerical ids (from an .srt file) it's not very hard to get duplicate ids.

I'm also wondering if this could be solved in the same way as the "unknown end time" problem. Something like this has been proposed in some bug or mail thread:

    00:00:00 --> auto
    This is an xeample

    00:00:00 --> 00:00:10
    This is an example

Basically, when a new cue is added, any existing cue with the auto end time would be given the same end time as the start time of the new cue, in this case resulting in the "xeample" cue getting startTime==endTime==0, but it could be something different.

I'm not sure how to end a cue without actually having a new one, though.

Are we looking for a solution that works for in-band, out-of-band, or both? For in-band it's probably hard to represent a non-numeric (auto) end time. For out-of-band, it would be blocked on supporting live streaming at all:
https://www.w3.org/Bugs/Public/show_bug.cgi?id=14104
Comment 2 Brendan Long 2014-05-28 15:04:56 UTC
(In reply to Philip Jägenstedt from comment #1)
> If done as part of the parser, I think an explicit syntax for this would be
> safer, since if you're copying and pasting some cues with numerical ids
> (from an .srt file) it's not very hard to get duplicate ids.

Aren't SRT cues supposed to have unique IDs though?

> I'm also wondering if this could be solved in the same way as the "unknown
> end time" problem. Something like this has been proposed in some bug or mail
> thread:
> 
>     00:00:00 --> auto
>     This is an xeample
> 
>     00:00:00 --> 00:00:10
>     This is an example
> 
> Basically, when a new cue is added, any existing cue with the auto end time
> would be given the same end time as the start time of the new cue, in this
> case resulting in the "xeample" cue getting startTime==endTime==0, but it
> could be something different.

That would work.

> I'm not sure how to end a cue without actually having a new one, though.

We could allow empty cues, which cause any "auto" cues to end, but don't generate JS TextTrackCues:

00:00:00 --> auto
Example cue

00:00:10 --> 00:00:10


This might also be a solution for the live cue thing, since we could throw empty heartbeat cues in the file every couple seconds so the UA knows where there cue file ends.

> Are we looking for a solution that works for in-band, out-of-band, or both?
> For in-band it's probably hard to represent a non-numeric (auto) end time.
> For out-of-band, it would be blocked on supporting live streaming at all:
> https://www.w3.org/Bugs/Public/show_bug.cgi?id=14104

At the moment, this would only apply to in-band, but I think it's safe to assume that we'll eventually have live out-of-band tracks (hopefully?).
Comment 3 Silvia Pfeiffer 2014-06-01 10:33:47 UTC
I think proposals in #c1 and #c2 can work.

We'd have to add this functionality to the HTML TextTrack spec, though, since text track cues have a specified end time there and raise enter, exit and cuechange events.
Comment 4 Brendan Long 2014-06-02 14:43:35 UTC
(In reply to Silvia Pfeiffer from comment #3)
> I think proposals in #c1 and #c2 can work.
> 
> We'd have to add this functionality to the HTML TextTrack spec, though,
> since text track cues have a specified end time there and raise enter, exit
> and cuechange events.

It seems like the only change we would need to make is allow endTime = null (meaning "we don't know yet").
Comment 5 Silvia Pfeiffer 2014-06-02 22:26:22 UTC
(In reply to Brendan Long from comment #4)
> (In reply to Silvia Pfeiffer from comment #3)
> > I think proposals in #c1 and #c2 can work.
> > 
> > We'd have to add this functionality to the HTML TextTrack spec, though,
> > since text track cues have a specified end time there and raise enter, exit
> > and cuechange events.
> 
> It seems like the only change we would need to make is allow endTime = null
> (meaning "we don't know yet").

Not only. A subsequent cue may also need to change the endTime field of the previous cue with endTime="null" (or "auto" or whichever we choose)? Also a question is whether we want to raise events for such a change of cue content on both cues, the one that is being changed and the one that introduces the change?
Comment 6 Brendan Long 2014-06-03 15:20:54 UTC
(In reply to Silvia Pfeiffer from comment #5)
> Not only. A subsequent cue may also need to change the endTime field of the
> previous cue with endTime="null" (or "auto" or whichever we choose)?

This reminded me of another potential issue. What if we have multiple cues going at the same time? For example, we could have two TV characters talking at the same time, with captions to be displayed on different parts of the screen. We wouldn't want one of them to just go away because there was another cue that started right after.

It's possible we could handle this by making everything on the screen into one cue, but I'm not sure if that would work well with roll-up captions.

> Also a
> question is whether we want to raise events for such a change of cue content
> on both cues, the one that is being changed and the one that introduces the
> change?

It seems reasonable to have an event to show that we know what the end time is. I don't think we need to fire an event on the new cue though.
Comment 7 Philip Jägenstedt 2014-06-04 14:30:59 UTC
(In reply to Brendan Long from comment #2)
> (In reply to Philip Jägenstedt from comment #1)
> > If done as part of the parser, I think an explicit syntax for this would be
> > safer, since if you're copying and pasting some cues with numerical ids
> > (from an .srt file) it's not very hard to get duplicate ids.
> 
> Aren't SRT cues supposed to have unique IDs though?

Sure, but it seems easy enough to mess up with some manual editing.

> > Are we looking for a solution that works for in-band, out-of-band, or both?
> > For in-band it's probably hard to represent a non-numeric (auto) end time.
> > For out-of-band, it would be blocked on supporting live streaming at all:
> > https://www.w3.org/Bugs/Public/show_bug.cgi?id=14104
> 
> At the moment, this would only apply to in-band, but I think it's safe to
> assume that we'll eventually have live out-of-band tracks (hopefully?).

I guess, eventually.

For in-band, would it be a problem to represent non-numeric end times like "auto"? I haven't really thought about this, if that's a problem the the id, settings and cue text itself are the remaining places to throw the magic at.
Comment 8 Philip Jägenstedt 2014-06-04 14:33:55 UTC
(In reply to Brendan Long from comment #6)
> (In reply to Silvia Pfeiffer from comment #5)
> > Not only. A subsequent cue may also need to change the endTime field of the
> > previous cue with endTime="null" (or "auto" or whichever we choose)?
> 
> This reminded me of another potential issue. What if we have multiple cues
> going at the same time? For example, we could have two TV characters talking
> at the same time, with captions to be displayed on different parts of the
> screen. We wouldn't want one of them to just go away because there was
> another cue that started right after.

Do you ever have multiple cues in live captioning? If so then that does complicate matters.
Comment 9 Brendan Long 2014-06-05 17:14:39 UTC
(In reply to Philip Jägenstedt from comment #8)
> (In reply to Brendan Long from comment #6)
> > (In reply to Silvia Pfeiffer from comment #5)
> > > Not only. A subsequent cue may also need to change the endTime field of the
> > > previous cue with endTime="null" (or "auto" or whichever we choose)?
> > 
> > This reminded me of another potential issue. What if we have multiple cues
> > going at the same time? For example, we could have two TV characters talking
> > at the same time, with captions to be displayed on different parts of the
> > screen. We wouldn't want one of them to just go away because there was
> > another cue that started right after.
> 
> Do you ever have multiple cues in live captioning? If so then that does
> complicate matters.

I think it happens fairly often. For live captions I guess we usually would do roll-up, but it would significantly complicate things if we needed to do "the entire screen is one cue" for simple captions, and "each line is a cue and the end times don't really matter" for roll-up captions.
Comment 10 Philip Jägenstedt 2014-06-05 22:03:33 UTC
For in-band, would it work if the (HTML) spec simply allowed removing cues and adding new ones at any time, so that when a cue is updated it is removed and then added back with the new text and settings?

The alternative is to allow updating existing cues. That seems slightly worse to me, because generic scripts doing things with cues are less likely to handle that situation (if events were added for it) than new cues being added.

I'm thinking that if that works for in-band, then we can maybe add syntax for out-of-band to do the same thing if the need arises, and after live out-of-band WebVTT works at all.
Comment 11 Brendan Long 2014-06-05 22:10:02 UTC
(In reply to Philip Jägenstedt from comment #10)
> For in-band, would it work if the (HTML) spec simply allowed removing cues
> and adding new ones at any time, so that when a cue is updated it is removed
> and then added back with the new text and settings?

I'm not aware of any reason we can't do this right now, it just makes it much harder for us to implement. If WebVTT supported everything we need, then our media engine could just transcode to WebVTT and we can dump that directly into the browser. If WebVTT doesn't support this, then we need to layer some extra information into our media pipeline and have a special code path for detecting these cues and replacing them based on rules for the original format (and we need a way to detect what the original format was..).

Basically, it makes it much harder to keep decoding in the media playback library.

> The alternative is to allow updating existing cues. That seems slightly
> worse to me, because generic scripts doing things with cues are less likely
> to handle that situation (if events were added for it) than new cues being
> added.

I can't decide which is better. I don't think it matter much to us though.

> I'm thinking that if that works for in-band, then we can maybe add syntax
> for out-of-band to do the same thing if the need arises, and after live
> out-of-band WebVTT works at all.

See above for why it would be useful if we can just fully translate to WebVTT. I also can't imagine how W3C could get away with exclusively pushing a format that can't handle live playback at all. It seems like at some point this *has* to work (or we need to allow another cue format).
Comment 12 Silvia Pfeiffer 2014-06-05 23:59:03 UTC
(In reply to Philip Jägenstedt from comment #8)
> (In reply to Brendan Long from comment #6)
> > (In reply to Silvia Pfeiffer from comment #5)
> > > Not only. A subsequent cue may also need to change the endTime field of the
> > > previous cue with endTime="null" (or "auto" or whichever we choose)?
> > 
> > This reminded me of another potential issue. What if we have multiple cues
> > going at the same time? For example, we could have two TV characters talking
> > at the same time, with captions to be displayed on different parts of the
> > screen. We wouldn't want one of them to just go away because there was
> > another cue that started right after.
> 
> Do you ever have multiple cues in live captioning? If so then that does
> complicate matters.

It could be done with regions, where you only replace/end the cues in the same region.
Comment 13 Philip Jägenstedt 2014-06-06 21:00:38 UTC
(In reply to Brendan Long from comment #11)
> (In reply to Philip Jägenstedt from comment #10)
> > For in-band, would it work if the (HTML) spec simply allowed removing cues
> > and adding new ones at any time, so that when a cue is updated it is removed
> > and then added back with the new text and settings?
> 
> I'm not aware of any reason we can't do this right now, it just makes it
> much harder for us to implement. If WebVTT supported everything we need,
> then our media engine could just transcode to WebVTT and we can dump that
> directly into the browser. If WebVTT doesn't support this, then we need to
> layer some extra information into our media pipeline and have a special code
> path for detecting these cues and replacing them based on rules for the
> original format (and we need a way to detect what the original format was..).
> 
> Basically, it makes it much harder to keep decoding in the media playback
> library.

I don't follow, are you saying that "removing cues and adding new ones at any time" would work, or that it wouldn't? It sounds a bit like you want to transcode anything your media engine supports into an out-of-band WebVTT file, can you clarify exactly what you're trying to do?
Comment 14 Philip Jägenstedt 2014-06-06 21:21:48 UTC
(In reply to Silvia Pfeiffer from comment #12)
> (In reply to Philip Jägenstedt from comment #8)
> > (In reply to Brendan Long from comment #6)
> > > (In reply to Silvia Pfeiffer from comment #5)
> > > > Not only. A subsequent cue may also need to change the endTime field of the
> > > > previous cue with endTime="null" (or "auto" or whichever we choose)?
> > > 
> > > This reminded me of another potential issue. What if we have multiple cues
> > > going at the same time? For example, we could have two TV characters talking
> > > at the same time, with captions to be displayed on different parts of the
> > > screen. We wouldn't want one of them to just go away because there was
> > > another cue that started right after.
> > 
> > Do you ever have multiple cues in live captioning? If so then that does
> > complicate matters.
> 
> It could be done with regions, where you only replace/end the cues in the
> same region.

Unless you're already using regions for other reasons, it seems like you'd need a new region per cue, since you can't know if the currently showing cues are going to end together or not. Using the id like in Brendan's original suggestion would be more direct in that case.
Comment 15 Silvia Pfeiffer 2014-06-07 04:32:14 UTC
(In reply to Philip Jägenstedt from comment #14)
> > > > This reminded me of another potential issue. What if we have multiple cues
> > > > going at the same time? For example, we could have two TV characters talking
> > > > at the same time, with captions to be displayed on different parts of the
> > > > screen. We wouldn't want one of them to just go away because there was
> > > > another cue that started right after.
> > > 
> > > Do you ever have multiple cues in live captioning? If so then that does
> > > complicate matters.
> > 
> > It could be done with regions, where you only replace/end the cues in the
> > same region.
> 
> Unless you're already using regions for other reasons, it seems like you'd
> need a new region per cue, since you can't know if the currently showing
> cues are going to end together or not. Using the id like in Brendan's
> original suggestion would be more direct in that case.

I was referring to the use case Brendan brought up: two cues on screen at the same time in different screen areas that need replacement independently. That is exactly the same use case that regions satisfy, not a different one. And you would not need a region per cue to satisfy this, since the cues are already grouped into regions because of the same reason why they would replace each other.

For example: with two people speaking on screen, you would have a region on the left (possibly scrolling) and one at the right (also possibly scrolling), and if they talk over the top of each other. A cue with an "auto" end time in the left region would only be terminated by another cue in the left region, while one in the right region is only terminated by another cue in the right region.
Comment 16 Brendan Long 2014-06-08 19:32:48 UTC
(In reply to Silvia Pfeiffer from comment #12)
> (In reply to Philip Jägenstedt from comment #8)
> > (In reply to Brendan Long from comment #6)
> > > (In reply to Silvia Pfeiffer from comment #5)
> > > > Not only. A subsequent cue may also need to change the endTime field of the
> > > > previous cue with endTime="null" (or "auto" or whichever we choose)?
> > > 
> > > This reminded me of another potential issue. What if we have multiple cues
> > > going at the same time? For example, we could have two TV characters talking
> > > at the same time, with captions to be displayed on different parts of the
> > > screen. We wouldn't want one of them to just go away because there was
> > > another cue that started right after.
> > 
> > Do you ever have multiple cues in live captioning? If so then that does
> > complicate matters.
> 
> It could be done with regions, where you only replace/end the cues in the
> same region.

I'll have to check, but I think this would work for our case.
Comment 17 Brendan Long 2014-06-08 19:35:48 UTC
(In reply to Philip Jägenstedt from comment #13)
> (In reply to Brendan Long from comment #11)
> > (In reply to Philip Jägenstedt from comment #10)
> > > For in-band, would it work if the (HTML) spec simply allowed removing cues
> > > and adding new ones at any time, so that when a cue is updated it is removed
> > > and then added back with the new text and settings?
> > 
> > I'm not aware of any reason we can't do this right now, it just makes it
> > much harder for us to implement. If WebVTT supported everything we need,
> > then our media engine could just transcode to WebVTT and we can dump that
> > directly into the browser. If WebVTT doesn't support this, then we need to
> > layer some extra information into our media pipeline and have a special code
> > path for detecting these cues and replacing them based on rules for the
> > original format (and we need a way to detect what the original format was..).
> > 
> > Basically, it makes it much harder to keep decoding in the media playback
> > library.
> 
> I don't follow, are you saying that "removing cues and adding new ones at
> any time" would work, or that it wouldn't? It sounds a bit like you want to
> transcode anything your media engine supports into an out-of-band WebVTT
> file, can you clarify exactly what you're trying to do?

We have two use-cases:

  - We want to translate CEA-708 into WebVTT in-band so we can re-use the WebVTT rendering code, and we want to translate into WebVTT without anything extra so we can re-use existing media pipelines (like GStreamer's plugins).

  - We want a path forward where we can translate an MPEG-TS stream into WebM + WebVTT on the server-side in a live stream.

The problem with having special code to remove cues and add new ones is that it wouldn't work in the second case, and it would work poorly in the first (because we'd have to implement CEA-708 to WebVTT translation in the UA instead of the media engine).
Comment 18 Philip Jägenstedt 2014-10-10 09:07:44 UTC
I'm marking this as v2, since we don't have a spec yet, there's no chance that this will be interoperably implemented and shipping soon.