Re: [css3-speech] voice-volume from Daniel Weck on 2011-04-28 (www-style@w3.org from April 2011)

From: Daniel Weck <daniel.weck@gmail.com>
Date: Thu, 28 Apr 2011 23:50:35 +0100
To: W3C style mailing list <www-style@w3.org>, fantasai <fantasai.lists@inkedblade.net>
Message-Id: <33F52A73-5EA3-41DA-83F0-6E1AD671034B@gmail.com>
On 28 Apr 2011, at 08:00, fantasai wrote:
> voice-volume
>
> # silent, x-soft, soft, medium, loud, and x-loud
> #    A sequence of monotonically non-decreasing volume levels.
> #    The value of ‘silent’ is mapped to ‘0’ and ‘x-loud’ is
> #    mapped to ‘100’. The mapping of other values to numerical
> #    volume levels is implementation-dependent and may vary
> #    from one speech synthesizer to another.
>
> Because this definition doesn't map 'medium' to anything, it
> makes it near-impossible for an author to use the absolute
> values, assuming 'medium' (and not 'x-loud') is user's
> preferred volume and the author intends to use that as the
> baseline volume.

Well, the volume scale is linear amplitude, so (for the sake of  
argument) a simple fix would be to explicitly state the actual values  
corresponding to each keyword:

silent => 0

x-soft => 15
soft => 30
medium => 50
loud => 75

x-loud => 100 (max tolerable loudness, defined by user)

_however_, this has limited usefulness, because the keywords are just  
"shortcuts" to numerical values (i.e. "named values"). As you rightly  
said, a more useful feature would be a keyword enumeration that maps  
to "softest audible", "loudest tolerable", and "preferred volume". My  
feeling is that the 5 values (excluding silence) defined by SSML aim  
to express just that:

x-soft => "softest audible"
soft => ?
medium => "preferred volume"
loud => ?
x-loud => "loudest tolerable"

...but of course the "soft" and "loud" values remain slightly under- 
specified (i.e. what should implementors do, and what should authors  
expect when using these values ?).

> Afaict, it's unlikely that the absolute
> scale can be used for anything other than fading from x-loud
> to silence.

Sure, a cursor can be moved on the linear volume scale to animate the  
wave amplitude, that's a useful feature in itself.

I agree that without a deterministic mapping between keywords (which  
we assume represent "softest", "preferred" and "loudest" + two in- 
between steps) and absolute values, authors cannot produce content  
using numerical values that predictably meet concrete user needs or  
user-agent's "reasonable" pre-defined settings, because, for example,  
"medium" (or "preferred volume") may not necessarily correspond to  
50.0 ... it could be 90 for a reading system operating in a loud  
environment.

However, this doesn't mean that numerical values are pointless, in  
fact there might also be use-cases where the enumerated keywords are  
not used at all.

> Percentages are tricky, because due to nesting, it's not
> possible to reference against 'medium', which I assume in
> most cases is what you'd want to do, right?

Well, the remark above about the usefulness of absolute numerical  
values apply to percentages too, given that they are relative to the  
inherited computed value which is situated on the somewhat-abstract  
linear [0,100] amplitude scale.

We would need another syntax of property value in order to provide  
volume adjustment relative to a keyword. For example:

span.half-x-loud
{
voice-volume: 50% x-loud;
}

Are you requesting this feature, or merely pointing-out that it is not  
currently doable ? In my opinion, this is still as non-deterministic  
as the absolute values case ("50% x-loud" may effectively resolve to  
"medium"...but maybe not).

> It seems to me that what an author would really need is a
> scale that varies between "softest audible", "loudest
> tolerable", and "preferred volume", where each of these are
> set by the listener. The keywords give you that scale, but
> there are only 5 points on this scale, as opposed to infinite
> on the absolute scale, which strikes me as less useful in
> general...

Well, we either have a (short) enumeration, with tangible, easily- 
usable mapping to user values, or we have a scale with a large number  
(technically, near-infinite) of abstract steps. Currently, we provide  
both, and the only direct connection between the two is the 0/min and  
100/max boundaries. It works (i.e. it can be implemented  
unambiguously), but I agree that we lack a good understanding of how  
authors benefit from the enormous number of absolute values.

> I'm having a hard time understanding how the capabilities
> of this property would be used, but I suspect it's not matching
> the authoring story very well. Perhaps you could explain how
> voice-volume values other than the keywords would be used?

I don't have a concrete usage in mind where absolute numerical values  
would be more useful to authors than 3 (or 5) pre-defined user-centric  
keyword-based volume levels.

I am not aware of SSML's rationale for this design choice, but I think  
CSS-Speech should aim to remain compatible with SSML notation.  It  
doesn't really hurt anyone, right ? Unless of course the specification  
itself is ambiguous, which I think isn't.

Regards, Daniel
Received on Thursday, 28 April 2011 22:51:06 UTC