This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 10944 - Add speech synthesis features to WebVTT descriptions
Summary: Add speech synthesis features to WebVTT descriptions
Status: NEW
Alias: None
Product: TextTracks CG
Classification: Unclassified
Component: WebVTT (show other bugs)
Version: unspecified
Hardware: All All
: P2 normal
Target Milestone: ---
Assignee: Silvia Pfeiffer
QA Contact: This bug has no owner yet - up for the taking
Whiteboard: v2
Keywords: media, NotInW3CSpecYet
Depends on:
Reported: 2010-10-01 07:27 UTC by Masatomo Kobayashi
Modified: 2015-10-05 00:56 UTC (History)
8 users (show)

See Also:


Description Masatomo Kobayashi 2010-10-01 07:27:06 UTC
Section affected:

WebSRT, especially voice declaration and cue text, seems to have only caption-specific features. As WebSRT is supposed to be used to provide synthesized audio descriptions (and possibly synthesized sign languages in the near future), additional options for audio descriptions would be needed. Or these optional features could be moved to other places, e.g., CSS.
Comment 1 Tab Atkins Jr. 2010-10-01 14:55:46 UTC
What makes you think that WebSRT will provide synthesized audio or SL?  WebSRT is a captioning format.  Perhaps some other feature of <video> or UAs will provide those, but I don't think it's something that WebSRT is meant to handle.
Comment 2 John Foliot 2010-10-01 15:19:59 UTC
(In reply to comment #1)
> What makes you think that WebSRT will provide synthesized audio or SL?  WebSRT
> is a captioning format.  Perhaps some other feature of <video> or UAs will
> provide those, but I don't think it's something that WebSRT is meant to handle.

WebSRT is, as far as I can tell, a format for applying time-stamped data for the synchronization of texts to media. It can be used for captions, sub-titling and possibly other uses as well. It is not yet part of the W3C specification, and is in fact a draft spec produced by WHATWG. (I am unaware of any production ready examples in the wild)

If WebSRT cannot be used to also deliver synthesized audio then it might not be the right candidate as a baseline time-stamp format, as this need is both clear and real: descriptive text is an identified requirement that has both legal precedent and real-world examples. As well, using the time/synchronization "time-stamps" we should be able to provide descriptive texts to non-sighted users, and the IBM team (of which  Masatomo is a part of) have already developed a proof of concept that uses time-stamped texts and synthesized voices to deliver this requirement.

It is for these reasons that WebSRT has not yet been incorporated into the W3C spec - the assessment of the suitability or non-suitability of that possible format has not yet been completed.

Tab if this is a topic of interest to you, I encourage you (and others) to review - in fact, one of the next steps is to start mapping potential solutions against this checklist looking for holes and defects. Help here would be greatly appreciated - feel free to ping the mailing list and preface the subject with [media] - we can use all the help we can get. (Come on in, grab a spot!)
Comment 3 Ian 'Hixie' Hickson 2010-10-05 22:00:24 UTC
What audio description features are you missing?
Comment 4 Masatomo Kobayashi 2010-10-06 11:37:20 UTC
As speech synthesis features, based on our study, the "voice family" (specific names or gender) and "volume" will be required. The "pause", "rate", "pitch", and "balance" will also be useful. The CSS Speech Module covers these features, just like the traditional CSS covers some WebSRT cue settings features such as the size and position.

As an audio description-specific feature, the "behavior when the speech synthesis hasn't finished by the end time" will be able to be specified.
Comment 5 Ian 'Hixie' Hickson 2010-10-12 08:49:09 UTC
The stuff covered by CSS should just be covered by CSS. Before I spec that, though, I'd like some implementation experience so that we can make sure that's sane. Please let me know when there's an implementation of this feature so that I can study it and ask the relevant implementors for their experience.

For example, the case you bring up of a description that's too long to play coherently in the cue's time span is a good one. What do implementors find their users want? Should the video slow down? Pause? Should the API expose this? These are all questions that it'd be good to get figured out before we specify it.
Comment 6 Silvia Pfeiffer 2013-07-08 11:44:42 UTC
Re-opening and assigning to TextTracks CG, where this will need to be specified for WebVTT.