20529 – SpeechSynthesisUtterance interface voiceURI attribute should be voice

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 20529 - SpeechSynthesisUtterance interface voiceURI attribute should be voice

Summary: SpeechSynthesisUtterance interface voiceURI attribute should be voice

Status:	RESOLVED FIXED

Alias:	None

Product:	Speech API
Classification:	Unclassified
Component:	Speech API (show other bugs)
Version:	unspecified
Hardware:	All All

Importance:	P2 normal
Target Milestone:	---
Assignee:	Glen Shires
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2012-12-28 14:57 UTC by Trevor Saunders
Modified:	2013-02-25 19:19 UTC (History)
CC List:	5 users (show)

See Also:

Attachments

Description Trevor Saunders 2012-12-28 14:57:55 UTC

the SpeechSynthesisUtterance interface currently has
attribute DOMString voiceURI
but that's bad because to do anything with it you must llook up the voice with that URI in the SpeechSynthesisVoiceList.

Instead there should be
attribute SpeechSynthesisVoice voice
that way no lookup is required.

Comment 1 Glen Shires 2013-01-01 19:55:56 UTC

I envision that most developers will either use the default voice, or will call getVoices() to get the list of available voices. The developer may then select one of these instances by setting SpeechSynthesisUtterance.voiceURI to the desired SpeechSynthesisVoiceList[i].voiceURI

If I correctly understand the concern in this bug, the developer would be using SpeechSynthesisUtterance.voiceURI to inquire the characteristics of the default or current voice. This would require a call to getVoices to either scan for the "default" attribute or matching voiceURI. I'm not sure why this is a bad thing because it seems to me that most developers doing this, would do so to select the most appropriate voice, in which case they'd need to call getVoices anyway.

Note that if SpeechSynthesisUtterance.SpeechSynthesisVoice were exposed as a short-cut to inquire the characteristics of the current voice, it probably still should be readonly, otherwise it would be hard to define how setting one SpeechSynthesisVoice attribute might affect the others.

Since getVoices is a synchronous call, it should return quickly and thus I expect most UA implementations would not perform a round-trip to the server (for example, it might instead cache the values). If this is a problem, then perhaps this is the bug we should address here, and instead return the SpeechSynthesisVoiceList results in an EventHandler. However, this will add complexity for the developer, so we should think through this trade-off carefully.

Comment 2 Trevor Saunders 2013-01-02 20:16:15 UTC

(In reply to comment #1)
> I envision that most developers will either use the default voice, or will
> call getVoices() to get the list of available voices. The developer may then
> select one of these instances by setting SpeechSynthesisUtterance.voiceURI
> to the desired SpeechSynthesisVoiceList[i].voiceURI

I mostly agree with this, but I don't see why you prefer utterance.voiceURI = voices[i].voiceURI to utterance.voice = voices[i];  That is I don't see why setting the voice for the utterance should be indirected through setting a URI.

> Note that if SpeechSynthesisUtterance.SpeechSynthesisVoice were exposed as a
> short-cut to inquire the characteristics of the current voice, it probably
> still should be readonly, otherwise it would be hard to define how setting
> one SpeechSynthesisVoice attribute might affect the others.

in comment 0 I was suggesting we get rid of .voiceURI and only have .voice to handle this issue.

Comment 3 Glen Shires 2013-01-05 06:08:52 UTC

When calling getVoices() and setting utterance.voiceURI = voices[i].voiceURI, a URN can be used. (That is, whatever string voices[i].voiceURI returns can simply be treated as string, not a URL.)

Another way that developers might use voiceURI is to set it directly to a URL to access a third-party TTS service. For example, example.com/en-us/female might be used to access a particular voice on the example.com service. (Yes, this assumes the developer knows in advance the URL, or has some other means of determining it. It also assumes the browser knows how to connect to such a third-party service.)

In both cases, utterance.voiceURI is an attribute of type DOMString.

If I correctly understand, you propose instead that utterance.voice be an attribute of type SpeechSynthesisVoice. I agree this simplifies the former case (utterance.voice = voices[i]), but it eliminates the potential extensibility of the latter case.

Another possibility is to use utterance.voice of type SpeechSynthesisVoice as you propose, and add a serviceURI parameter to getVoices(serviceURI) -- and define getVoices(NULL) as returning the default voices. If we did this, we'd need to make getVoices asynchronous by adding a onvoicelist EventHandler - but perhaps we need to do this anyway to handle cases where the default voices are implemented on a remote server.

More specifically:

 interface SpeechSynthesis : EventTarget {
    ...
    // Async call to get available voices at serviceURI.
    // If serviceURI is NULL, will return default voices.
    void getVoices(DOMString serviceURI);

    // Uses SpeechSynthesisVoicesEvent interface.
    attribute EventHandler onvoicelist;
  }

 interface SpeechSynthesisVoicesEvent : Event {
   readonly attribute SpeechSynthesisVoiceList voices;
 };

 interface SpeechSynthesisUtterance : EventTarget {
   // If NULL, default voice is used.
   attribute SpeechSynthesisVoice voice;
   ...
 }

Comment 4 Trevor Saunders 2013-01-05 06:44:06 UTC

> Another way that developers might use voiceURI is to set it directly to a
> URL to access a third-party TTS service. For example,
> example.com/en-us/female might be used to access a particular voice on the
> example.com service. (Yes, this assumes the developer knows in advance the
> URL, or has some other means of determining it. It also assumes the browser
> knows how to connect to such a third-party service.)

Its of course possible, but I don't think browsers really want to deal with connecting to arbitrary services.  People could of course define a protocol to people to make tts requests, but I just don't see there being enough demand for people to bother.

Note if you make this possible you then have to decide what to do about SpeechSynthesis.getVoices() since it cann't possibly enumerate all the available voices.  So you have a choice you can list the default available voices, and add or not add new voices that you learn about when someone sets SpeechSynthesisUtterance.voiceURI to something you haven't seen before.

> If I correctly understand, you propose instead that utterance.voice be an
> attribute of type SpeechSynthesisVoice. I agree this simplifies the former

correct

> case (utterance.voice = voices[i]), but it eliminates the potential
> extensibility of the latter case.

That's true, but I don't think that extensibility is very important, and I suspect will never be implemented.

> Another possibility is to use utterance.voice of type SpeechSynthesisVoice
> as you propose, and add a serviceURI parameter to getVoices(serviceURI) --
> and define getVoices(NULL) as returning the default voices. If we did this,
> we'd need to make getVoices asynchronous by adding a onvoicelist
> EventHandler - but perhaps we need to do this anyway to handle cases where
> the default voices are implemented on a remote server.

It seems like we could add all this stuff in a  backwards compatable way so I'd tend to hold off on implementing this until someone really sees the need for the extensibility it allows.

Comment 5 Dominic Mazzoni 2013-01-14 21:10:34 UTC

I think I agree with Trevor here, I prefer just "voice" over voiceURI.

At this point there's no open standard for a server TTS implementation, so it's pure speculation what one might look like; it may turn out that a URI is insufficient to specify the interface. It's also unclear whether it makes sense for a webpage to specify a particular TTS server, or for the browser to trust that resource.

I have no strong preference as to whether the attribute should be a string or whether it should be a reference to a SpeechSynthesisVoice. One advantage of a string is that it's more easily serializable (i.e. you could dump an utterance with JSON.serialize).

Comment 6 chris fleizach 2013-01-23 19:55:11 UTC

I agree, having the voiceURI was weird for me when using it. I would have preferred to stick a voice object in there

Comment 7 Glen Shires 2013-01-31 21:54:26 UTC

In trying to summarize this thread, I think there's two proposals here:


(1) Simply rename "voiceURI" to "voice" with no change in functionality:

    interface SpeechSynthesisUtterance : EventTarget {
      attribute DOMString voice;           // was voiceURI
     };
    interface SpeechSynthesisVoice {
      readonly attribute DOMString voice;  // was voiceURI
    };

The disadvantage I see is that the string must be unique, and the name "voice" does not convey this as well. Perhaps renaming these to voiceURN might be more clear. 


(2) Change from DOMString to SpeechSynthesisVoice. Specifically:

    interface SpeechSynthesisUtterance : EventTarget {
      attribute SpeechSynthesisVoice voice;  // was DOMString voiceURI
     };
    interface SpeechSynthesisVoice {
      readonly attribute DOMString voiceURI; // NO CHANGE (or omit)
    };

To select a voice using option (2), the developer would either use getVoices:

    var voices = SpeechSynthesis.getVoices();
    // pick one, let's say the one at index i
    var u = new SpeechSynthesisUtterance();
    u.voice = voices[i];
    
or create a new voice:

    var voice = new SpeechSynthesisVoice();
    // select some attributes, perhaps leave others NULL.
    voice.lang = 'fr-FR';
    voice.localService = true;
    var u = new SpeechSynthesisUtterance();
    u.voice = voices[i];

At this point, it's up to the UA to select the most appropriate voice. (For example, if it only has a local-service English voice, and a remote-service French voice, which should it choose? What if it has neither, should it use the default voice?)  We could write priority rules / guidelines into the spec, or make it UA dependent.

Option (2) could also be used to select a remote service as follows ...

    var voice = new SpeechSynthesisVoice();
    voice.voiceURI = 'example.com/myvoice/etc/params';
    var u = new SpeechSynthesisUtterance();
    u.voice = voices[i];


A variant of (2) would be to eliminate SpeechSynthesisVoice.voiceURI and perhaps add it in the future when there is a protocol defined to select remote services.


Both options (1) and (2) have their pros and cons.  Which do you prefer?  Are there other options?

Comment 8 Trevor Saunders 2013-01-31 22:13:52 UTC

> Both options (1) and (2) have their pros and cons.  Which do you prefer? 
> Are there other options?

I prefer #2, it seems more expressive and flexible than #1.

Comment 9 Eitan Isaacson 2013-01-31 22:20:09 UTC

(In reply to comment #8)
> > Both options (1) and (2) have their pros and cons.  Which do you prefer? 
> > Are there other options?
> 
> I prefer #2, it seems more expressive and flexible than #1.

I like #2 also as a way for content scripts to create new voices with remote services. So speechSynthesis.getVoices() returns the available UA voices. And the content is free to create its own voices based on known services.

Comment 10 Dominic Mazzoni 2013-01-31 23:01:11 UTC

Another vote for option #2. That gives us the flexibility to allow a developer to specify a remote URL once a spec has emerged, but keeps the API cleaner in the meantime.

Comment 11 Glen Shires 2013-02-03 21:39:50 UTC

Option #2 appears to be the consensus, so I propose the following text for the errata. If there's no disagreement I'll add this to the errata on February 18.



Section 5.2 IDL: SpeechSynthesisUtterance "attribute DOMString voiceURI" should be "attribute SpeechSynthesisVoice voice".


Section 5.2.3 SpeechSynthesisUtterance Attributes: 
"voiceURI attribute" should be "voice attribute", with the following definition:

The voice attribute specifies speech synthesis voice that the web application wishes to use. If, at the time of the play method call, this attribute has been set to one of the SpeechSynthesisVoice objects returned by getVoices, then the user agent MUST use that voice. If this attribute is unset or null at the time of the play method call, then the user agent MUST use a user agent default voice. The user agent default voice SHOULD support the current language (see "lang" attribute) and can be a local or remote speech service and can incorporate end user choices via interfaces provided by the user agent such as browser configuration parameters.


Section 5.2.6 SpeechSynthesisVoice: voiceURI attribute
"as described in the SpeechSynthesisUtterance voiceURI attribute" should be "either through use of a URN with meaning to the user agent or by specifying a URL that the user agent recognizes as a local service."

Comment 12 Glen Shires 2013-02-25 19:19:12 UTC

I've updated the errata with the above change (E07 - Option #2):
https://dvcs.w3.org/hg/speech-api/rev/28f4291bfb17

As always, the current errata is at:
http://dvcs.w3.org/hg/speech-api/raw-file/tip/speechapi-errata.html