This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 10320 - [WebSRT voice] Allow an arbitrary string as the voice for forwards compatilbity
Summary: [WebSRT voice] Allow an arbitrary string as the voice for forwards compatilbity
Status: RESOLVED FIXED
Alias: None
Product: WHATWG
Classification: Unclassified
Component: HTML (show other bugs)
Version: unspecified
Hardware: Other other
: P3 normal
Target Milestone: Unsorted
Assignee: Ian 'Hixie' Hickson
QA Contact: contributor
URL: http://www.whatwg.org/specs/web-apps/...
Whiteboard:
Keywords: NotInW3CSpecYet
Depends on:
Blocks: 10306 10619 10746 10750
  Show dependency treegraph
 
Reported: 2010-08-09 12:09 UTC by contributor
Modified: 2012-07-19 07:22 UTC (History)
11 users (show)

See Also:


Attachments

Description contributor 2010-08-09 12:09:49 UTC
Section: http://www.whatwg.org/specs/web-apps/current-work/complete.html#parsing-0

Comment:
Allow an arbitrary string as the voice for forwards compatilbity

Posted from: 83.218.67.122
Comment 1 Philip Jägenstedt 2010-08-09 12:12:17 UTC
The WebSRT parser has this step: "Leave the cue's timed track cue voice identifier set to the empty string."

Instead, simplify the parser by allowing an arbitrary string as voice. This means that future revisions of WebSRT with new voices will be styleable in browsers implementing this revision of the spec.
Comment 2 Philip Jägenstedt 2010-08-09 12:13:54 UTC
I don't mean to suggest that it should be valid syntax, just that the parser and CSS glue handle it.
Comment 3 Ian 'Hixie' Hickson 2010-09-10 22:53:24 UTC
This would mean that if we introduced a new element, it would end up being considered the voice in older UAs... I guess that's not too bad...

Might be better just to go straight to have a generic voice syntax though. <$voice> maybe? With some predefined ones? (Predefined as in having default styles, like $narrator could be italics?) I don't know what a good syntax would be though. <:voice> is bad because it clashes with pseudo-class syntax in CSS, which might box us in later. *voice, !voice and +voice clash with other parts of CSS syntax. [voice] wouldn't be hidden in older SRT UAs. Any other suggestions? Maybe a V:voice setting on the timings line?
Comment 4 Philip Jägenstedt 2010-09-13 08:11:14 UTC
Is there a reason to differentiate between predefined voices and predefined tags? Given that they use the same syntax, I think it would be simplest for both implementors and authors to unify the handling of them, simply calling them tags. (Any tag can have a predefined style, and perhaps <narrator> and <i> are both italic by default.)

As for "extensibility", how about saying that we any tag name with a - (hyphen) in it may be used by authors for styling purposes. If they use this to style specific speakers, that's fine. <-hixie>, <ian-says> or <i-h> would all be ok to style parts spoken by you.

I think it's OK that CSS glue also works for new tag names without hyphens, but that validators should warn about it. That makes it easier to introduce new tags and style them in older UAs. That would be just like HTML, and hopefully people won't go overboard in abusing it.
Comment 5 Ian 'Hixie' Hickson 2010-09-28 00:47:26 UTC
If we make voices be just regular WebSRT cue spans (like <b> or <i>), we also introduce a number of new problems:

 - what does it mean to nest voices?
 - what does it mean to have voices nested inside <i> or <b> or worse still, <rt>?
 - what does it mean to have a cue with multiple voices, if we ever have an API that lets you filter on voice?

Still, there are definitely use cases for having a cue with different voices... British subtitles for example often have different colours per speaker and will mash them all into one paragraph. And I agree that it would be more usable to have speakers be named rather than numbered.

We could force them to be only at the top level of a cue, with, as you suggest, a special syntax for custom ones, and the default ones being "just tags". The reason they weren't before is that voice was a higher-level construct than part of the cue text; in retrospect, I should have just made it a V:foo setting. But that doesn't work if there are multiple voices per cue.

   <v mary> ... </v> ?
   <.mary> ... </.mary> ?

The end tag could be optional, meaning it goes to the end of the cue. Maybe do this only if the voice is given at the start of the cue?
Comment 6 Henri Sivonen 2010-09-29 08:54:33 UTC
I think we should address forward compat by using the HTML fragment parsing algorithm and using the class attribute for "voices".
Comment 7 Philip Jägenstedt 2010-09-30 12:54:35 UTC
(In reply to comment #6)
> I think we should address forward compat by using the HTML fragment parsing
> algorithm and using the class attribute for "voices".

What about these issues with using an HTML parser:

* Non-browser user agents are likely not going to be happy to include an HTML parser and ECMAScript engine for something as simple as subtitles. (Assuming the fragment parser executes <script>, something I assume but haven't checked.)

* 1 document per cue is likely significant memory overhead. Is 1 document per cue actually necessary, or just something that was suggested at some point?

* Things like <img> in cues will either behave erratically, or require complex logic to load them ahead of time.

* class="philip" and class="henri" doesn't carry the semantics that there are two different speakers, which would be necessary to allow the user (not author) to style different speakers differently.
Comment 8 Philip Jägenstedt 2010-09-30 13:06:55 UTC
(In reply to comment #5)
> If we make voices be just regular WebSRT cue spans (like <b> or <i>), we also
> introduce a number of new problems:
> 
>  - what does it mean to nest voices?

It means that one speaker begins speaking and another joins in half-way, much like the chipmunks. I doubt people would go to this effort in their markup, but the semantics of it is clear.

>  - what does it mean to have voices nested inside <i> or <b> or worse still,
> <rt>?

This indeed makes no sense, so voices ought to be top-level only.

>  - what does it mean to have a cue with multiple voices, if we ever have an API
> that lets you filter on voice?

Return all cues where the voice appears. If the filter takes multiple voices, return the cues where both voices appear, matching how getElementsByClassName works.

> Still, there are definitely use cases for having a cue with different voices...
> British subtitles for example often have different colours per speaker and will
> mash them all into one paragraph. And I agree that it would be more usable to
> have speakers be named rather than numbered.
> 
> We could force them to be only at the top level of a cue, with, as you suggest,
> a special syntax for custom ones, and the default ones being "just tags". The
> reason they weren't before is that voice was a higher-level construct than part
> of the cue text; in retrospect, I should have just made it a V:foo setting. But
> that doesn't work if there are multiple voices per cue.
> 
>    <v mary> ... </v> ?
>    <.mary> ... </.mary> ?
> 
> The end tag could be optional, meaning it goes to the end of the cue. Maybe do
> this only if the voice is given at the start of the cue?

I still think we should try as far as possible to allow authoring of subitles with and without HoH information in the same file, so it would probably help if the voice was a string that could, when the user so wishes, be appended to the beginning of the cue, as "Mary: I had a little lamb". Maybe:

<v Mary>
<v Peter Pan>
<v who="Peter Pan">

I'm not sure what the best syntax is, but you get the idea.

I'm at <http://universalsubtitles.org/opensubtitles2010> today, I'll have a chat with some of the HoH participants to see what they think makes sense. In particular, I'm not sure how critical the semantics of different speakers is, as opposed to just being able to convey it presentationally.
Comment 9 Ian 'Hixie' Hickson 2010-09-30 18:36:44 UTC
I very rarely, if ever, see captions (i.e. tracks for the hard of hearing) with names in the tracks, so I'm not sure it's a critical issue. But I don't mind making it possible in the future to use the voice name in the rendering (or to make the rendering be based on the voice name).

I guess I'll go with <v Name o'Character>, and make them only allowed at the top of the tree and ignored elsewhere.
Comment 10 Philip Jägenstedt 2010-10-01 01:53:41 UTC
(In reply to comment #9)
> I very rarely, if ever, see captions (i.e. tracks for the hard of hearing) with
> names in the tracks, so I'm not sure it's a critical issue. But I don't mind
> making it possible in the future to use the voice name in the rendering (or to
> make the rendering be based on the voice name).

I discussed this with two of the HoH participants at the aforementioned summit. Specifically, I asked how multiple speakers are typically represented. As far as I understand from the discussion, using different colors or positioning close to the speaker isn't very common at all, and also something they weren't big fans of. Both said that prefixing with the speaker name is quite fine, but I don't have any insight into how common it is. Perhaps it's only done when it's needed for off-screen speakers, and would be annoying if done for every single cue. I could follow up on this if it matters to the design of some feature.

> I guess I'll go with <v Name o'Character>, and make them only allowed at the
> top of the tree and ignored elsewhere.

Sounds good to me, if the CSS glue for it is sane enough, i.e. you can target a specific voice (is having spaces in the names a problem?) and also use the voice name in e.g. :before { content: ??? }
Comment 11 Philip Jägenstedt 2010-10-03 02:21:19 UTC
As part of my presentation for OVC I implemented most of the WebSRT parser and the interfaces, but none of the rendering rules. I also added the <v bla> syntax and tried it out a bit:

http://people.opera.com/philipj/2010/10/02/ovc/demos/captions.html
http://people.opera.com/philipj/2010/10/02/ovc/demos/transcript.html
http://people.opera.com/philipj/2010/10/02/ovc/demos/metadata.html (a bit stupid)

There are some issues that we may or may not want to address:

* In the movie I was translating, no characters have any names, and there are 3 people I would have called "Advisor", but I ended up calling the two least important of them "Advisor (brown)" and "Advisor (black)" based on their color of clothing. If it wasn't needed to disambiguate in the same scene, I would have just called them all "Advisor". I think it's safe to assume that if we use <v THIS STRING HERE> for styling, some characters will appear under several names and some names be used for several characters. Not sure if it's a real-world problem.

* For the HoH, you need to have cues for laughter, "hmpf" and such. I expressed that as <v Character><sound>Hmpf!, which isn't really 2 voices as the syntax implies. Essentially, what I want is a hook that means "show this when sound is not available", which is how I'm using <sound>.

* Not sure how to mark up on-screen text that needs to be translated, like the title of the film and "The End". I just made them italics. Have no strong opinion.
Comment 12 Ian 'Hixie' Hickson 2010-10-04 23:32:03 UTC
"show this when sound is not available" is the difference between subtitles and captions. The usual solution is to use two tracks (possibly generated from the same source file on the server)  this is essentially localisation.

I don't see anything wrong with having the same voice name used for multiple characters.

For the ":before" thing: there's no way currently to use generated content with WebSRT. If we add that some how, then it would be trivial to add a keyword to refer to the name of the nearest ancestor voice, though.

Re "<v Character><sound>Hmpf!", why do you need the <sound> there?
Comment 13 Philip Jägenstedt 2010-10-05 02:12:24 UTC
(In reply to comment #12)
> "show this when sound is not available" is the difference between subtitles and
> captions. The usual solution is to use two tracks (possibly generated from the
> same source file on the server)  this is essentially localisation.

Apart from the DRY principle, it just seems more likely that people with bother with HoH if they don't have to either (a) maintain two files that are largely the same or (b) implement server-side filtering. That's certainly true of myself, at least.

> I don't see anything wrong with having the same voice name used for multiple
> characters.
> 
> For the ":before" thing: there's no way currently to use generated content with
> WebSRT. If we add that some how, then it would be trivial to add a keyword to
> refer to the name of the nearest ancestor voice, though.

Right. I do think we should have this possibility. Having the semantics of speakers in the file is kind of pointless if you can't use it for styling somehow.

> Re "<v Character><sound>Hmpf!", why do you need the <sound> there?

I need it if using the same source file for captions with and without HoH cues, hiding <sound> when it's used as "subtitles" (or whatever mechanism, browser pref, etc is used to decide this).
Comment 14 Silvia Pfeiffer 2010-10-07 05:01:58 UTC
(In reply to comment #10)
> (In reply to comment #9)
> > I very rarely, if ever, see captions (i.e. tracks for the hard of hearing) with
> > names in the tracks, so I'm not sure it's a critical issue. But I don't mind
> > making it possible in the future to use the voice name in the rendering (or to
> > make the rendering be based on the voice name).
> 
> I discussed this with two of the HoH participants at the aforementioned summit.
> Specifically, I asked how multiple speakers are typically represented. As far
> as I understand from the discussion, using different colors or positioning
> close to the speaker isn't very common at all, and also something they weren't
> big fans of. Both said that prefixing with the speaker name is quite fine, but
> I don't have any insight into how common it is. Perhaps it's only done when
> it's needed for off-screen speakers, and would be annoying if done for every
> single cue. I could follow up on this if it matters to the design of some
> feature.
>

Note that this is a US-specific and a HoH-specific viewpoint.
Comment 15 Silvia Pfeiffer 2010-10-07 05:14:27 UTC
(In reply to comment #7)
> (In reply to comment #6)
> > I think we should address forward compat by using the HTML fragment parsing
> > algorithm and using the class attribute for "voices".
> 
> What about these issues with using an HTML parser:
> 
> * Non-browser user agents are likely not going to be happy to include an HTML
> parser and ECMAScript engine for something as simple as subtitles. (Assuming
> the fragment parser executes <script>, something I assume but haven't checked.)

Video players increasingly use a rudimentary or full HTML parser. We probably want to exclude <script> support though.


> * 1 document per cue is likely significant memory overhead. Is 1 document per
> cue actually necessary, or just something that was suggested at some point?

It was a suggestion.


> * Things like <img> in cues will either behave erratically, or require complex
> logic to load them ahead of time.

I expect the browser to do what it always does: when it comes across a <img> element, it loads the resource in a best effort. If it doesn't load in time, it stops loading it (just like when you navigate away from a page that is mid-loading).
Comment 16 Philip Jägenstedt 2010-10-13 08:50:29 UTC
(In reply to comment #14)
> (In reply to comment #10)
> > (In reply to comment #9)
> > > I very rarely, if ever, see captions (i.e. tracks for the hard of hearing) with
> > > names in the tracks, so I'm not sure it's a critical issue. But I don't mind
> > > making it possible in the future to use the voice name in the rendering (or to
> > > make the rendering be based on the voice name).
> > 
> > I discussed this with two of the HoH participants at the aforementioned summit.
> > Specifically, I asked how multiple speakers are typically represented. As far
> > as I understand from the discussion, using different colors or positioning
> > close to the speaker isn't very common at all, and also something they weren't
> > big fans of. Both said that prefixing with the speaker name is quite fine, but
> > I don't have any insight into how common it is. Perhaps it's only done when
> > it's needed for off-screen speakers, and would be annoying if done for every
> > single cue. I could follow up on this if it matters to the design of some
> > feature.
> >
> 
> Note that this is a US-specific and a HoH-specific viewpoint.

Also note that in the time I spent in the US after this, I watched TV with captions in the hotel a bit, and using positioning was actually very common, mostly with the text being aligned left, center or right, depending on speaker.

So, I shouldn't draw any big conclusions about anything.
Comment 17 Ian 'Hixie' Hickson 2010-10-14 07:00:06 UTC
At this point I'm leaning towards replacing the voice mechanism with <v foo>, where "foo" is any character name or other appropriate voice identifier. I think it might make sense to drop the hard-coded ones like <sound> and <narrator> and just have them be voices too, as in <v sfx> or <v narrator>. The <v foo> construct would only be allowed at the "top level" of a cue, and the </v> would be optional if the <v> was the first thing in the cue.

I am not sure it makes sense to have the same timed track for captions and subtitles, that seems like one of those things that sounds clever to software engineers like us but in practice is not that great (case in point, as far as I know it's never been done before). We could, though, in the future, introduce a new kind of cue that the browser could then offer either as a subtitle track or a caption track, with the subtitle version having certain things hidden and the caption track having character names shown in some cases, or some such.

For the specific example of "<v Mary><sound>cough", in practice most captions I've seen would instead use <v desc>Mary coughs</v>" or some such, in italics, not the word "cough" in a special colour or anything. But my experience here is admittedly limited.
Comment 18 Silvia Pfeiffer 2010-10-14 07:49:41 UTC
(In reply to comment #17)
> At this point I'm leaning towards replacing the voice mechanism with <v foo>,
> where "foo" is any character name or other appropriate voice identifier. I
> think it might make sense to drop the hard-coded ones like <sound> and
> <narrator> and just have them be voices too, as in <v sfx> or <v narrator>. The
> <v foo> construct would only be allowed at the "top level" of a cue, and the
> </v> would be optional if the <v> was the first thing in the cue.


That seems to be more akin to the role of <span class="blah"></span> in HTML. I think I like it. I also wonder if calling it "voice" may be too specific and it should just be a region marker, maybe <m foo></m> ?

 
> I am not sure it makes sense to have the same timed track for captions and
> subtitles, that seems like one of those things that sounds clever to software
> engineers like us but in practice is not that great (case in point, as far as I
> know it's never been done before). We could, though, in the future, introduce a
> new kind of cue that the browser could then offer either as a subtitle track or
> a caption track, with the subtitle version having certain things hidden and the
> caption track having character names shown in some cases, or some such.
> 
> For the specific example of "<v Mary><sound>cough", in practice most captions
> I've seen would instead use <v desc>Mary coughs</v>" or some such, in italics,
> not the word "cough" in a special colour or anything. But my experience here is
> admittedly limited.

I tend to agree - I think we may be trying to over-engineer this if we shove captions and subtitles in the same file. If somebody wanted to keep only one copy in a databases with some extra fields for captions that then help dynamically create the subtitles and caption files, that would equally deal with this potential duplication of data.

Another potential approach towards solving this is to allow the @kind attribute to have multiple values such that we can use e.g. an English caption file both for caption and subtitle purposes.
Comment 19 Philip Jägenstedt 2010-10-14 11:11:30 UTC
(In reply to comment #18)
> (In reply to comment #17)
> > I am not sure it makes sense to have the same timed track for captions and
> > subtitles, that seems like one of those things that sounds clever to software
> > engineers like us but in practice is not that great (case in point, as far as I
> > know it's never been done before). We could, though, in the future, introduce a
> > new kind of cue that the browser could then offer either as a subtitle track or
> > a caption track, with the subtitle version having certain things hidden and the
> > caption track having character names shown in some cases, or some such.
> > 
> > For the specific example of "<v Mary><sound>cough", in practice most captions
> > I've seen would instead use <v desc>Mary coughs</v>" or some such, in italics,
> > not the word "cough" in a special colour or anything. But my experience here is
> > admittedly limited.
> 
> I tend to agree - I think we may be trying to over-engineer this if we shove
> captions and subtitles in the same file. If somebody wanted to keep only one
> copy in a databases with some extra fields for captions that then help
> dynamically create the subtitles and caption files, that would equally deal
> with this potential duplication of data.
> 
> Another potential approach towards solving this is to allow the @kind attribute
> to have multiple values such that we can use e.g. an English caption file both
> for caption and subtitle purposes.

After actually trying to author such a file with stuff like "<v Mary><sound>cough" I'd be inclined to agree as well, this should at least wait a bit until people have had more time to experiment with what works and not. Ti that end, we should provide enough styling hooks to make this possible using only CSS though, which I think we'll do anyway.
Comment 20 Philip Jägenstedt 2010-10-14 11:40:56 UTC
(In reply to comment #18)
> (In reply to comment #17)
> > At this point I'm leaning towards replacing the voice mechanism with <v foo>,
> > where "foo" is any character name or other appropriate voice identifier. I
> > think it might make sense to drop the hard-coded ones like <sound> and
> > <narrator> and just have them be voices too, as in <v sfx> or <v narrator>. The
> > <v foo> construct would only be allowed at the "top level" of a cue, and the
> > </v> would be optional if the <v> was the first thing in the cue.
> 
> 
> That seems to be more akin to the role of <span class="blah"></span> in HTML. I
> think I like it. I also wonder if calling it "voice" may be too specific and it
> should just be a region marker, maybe <m foo></m> ?

I'm quite sympathetic towards removing the hard-coded voices. I'm not sure what the point of only allowing them at the top level is, that seems to make the parser slightly more complicated and disallows marking up chipmunk-style cues where part of the cue is spoken by more than one speaker. What's the benefit?

As for merging (the currently non-existent) styling hooks and voices syntax, I'm not so sure. At least with the <v Name O'Character> syntax and some CSS, it's possible to use that to show the speaker name somewhere for the benefit of HoH. Without that, you'd have to write specific CSS for each captions file.

As stated elsewhere, I actually think TTML does the speaker syntax quite well. The downside is that it complicates things and we'd need to put some stuff in a header.

Admittedly, this is a bit theoretical, since not many (any except TTML?) formats actually have semantic markup for the speaker.
Comment 21 Ian 'Hixie' Hickson 2010-12-14 02:11:51 UTC
Can you point me to the relevant part of TTML? If they have a good solution I'm quite happy to use it. I couldn't find it.
Comment 22 Silvia Pfeiffer 2010-12-14 03:42:41 UTC
(In reply to comment #21)
> Can you point me to the relevant part of TTML? If they have a good solution I'm
> quite happy to use it. I couldn't find it.

I just checked what Sean used as an example at http://www.w3.org/WAI/PF/HTML/wiki/TextFormat_Mapping_to_Requirements (useful to find TTML examples for our use cases). Here's how he marked it up:

<tt xml:lang="en" xmlns="http://www.w3.org/ns/ttml" xmlns:ttm="http://www.w3.org/ns/ttml#metadata">
  <head>
    <ttm:agent xml:id="connery" type="person">
      <ttm:name type="family">Connery</ttm:name>
      <ttm:name type="given">Thomas Sean</ttm:name>
      <ttm:name type="alias">Sean</ttm:name>
      <ttm:name type="full">Sir Thomas Sean Connery</ttm:name>
    </ttm:agent>
    <ttm:agent xml:id="bond" type="character">
      <ttm:name type="family">Bond</ttm:name>
      <ttm:name type="given">James</ttm:name>
      <ttm:name type="alias">007</ttm:name>
      <ttm:actor agent="connery"/>
    </ttm:agent>
  </head>
  <body>
    <div>
      ...
      <p ttm:agent="bond">I travel, a sort of licensed troubleshooter.</p>
      ...
    </div>
  </body>
</tt>

Basically TTML defines the people at the top in some metadata and then references them during the cue.
Comment 23 Philip Jägenstedt 2010-12-14 09:12:26 UTC
Hixie, what I was referring to was the ttm:agent attribute, see the example in http://www.w3.org/TR/ttaf1-dfxp/#metadata-vocabulary-agent-example-1

The good part being that a short id is used and is linked to the full name of the character and actor. I think that e.g. <v bond> and *optional* extra metadata about the voices would be a sensible approach, eventually.
Comment 24 Ian 'Hixie' Hickson 2010-12-26 19:53:19 UTC
EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are satisfied with this response, please change the state of this bug to CLOSED. If you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document:
   http://dev.w3.org/html5/decision-policy/decision-policy.html

Status: Accepted
Change Description: http://html5.org/tools/web-apps-tracker?from=5720&to=5721
Rationale: http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2010-December/029512.html