RE: Consolidated comments on SSML

Dear Martin (and the Internationalization Working Group),

First, I would like to thank you for your very thorough review
of the SSML specification.  Given the number of specifications you
must review annually, we are pleased that you have given so much of
your time and effort to help us with our specification.

Our responses to your comments will be in two big blocks and then
a slow trickle for points still under discussion.  This email
contains the first big block of responses.  Points skipped below will
be addressed in one of the later emails.

If you believe we have not adequately addressed your issues with our
responses, please let us know as soon as possible.  If we do not hear
from you within 14 days, we will take this as tacit acceptance.  Given
the volume of responses in this email, we understand that a complete
review by you may take longer than this amount of time; if so, we
would appreciate an estimate as to when you might be able to complete
your review.
Also, since the Voice Browser Working Group's next face-to-face
meeting is next week, any concerns you have with our
responses that you could send or hint at by Wednesday, June 4 would be
especially helpful.

Once again, thank you for your thorough and considered input on
the specification.

-- Dan Burnett

Synthesis Team Leader, VBWG

[VBWG responses follow]

[3] Yes, the <voice> tag. In section 3.1.2 (xml:lang), we will note
that the <voice> element can be used to change just the language.

[4] "this set" refers to "standards to enable access to the Web
using spoken interaction" from the previous sentence. If you
believe this to be unclear, can you suggest an appropriately compact
rewording (since this is text from the one-paragraph abstract)?

[5] Accepted.

[6] Rejected.  We had already planned to rearrange sections such that
section 2 now contains the Document Form (formerly section 3.1),
Conformance (formerly section 4), Integration (formerly 3.5), and
Fetching (formerly 3.6) sections straight off. If you believe this
to be insufficient, can you propose a specific text change for section 1?

[9] Accepted.

[10] We would welcome a specific text proposal from your group. Any
language example is fine with us.

[14] Accepted.  We will amend the text to indicate that only the
Schema reference is normative and not the references to RFC2396/2732.

[15] Accepted.  All that you say is correct. We will revise the
text to clarify as you suggest.

[16] Accepted.  Thank you. We will correct this.

[18] Accepted.

[19] Accepted with changes.  This is related to point 15. We will
reword this to correct the problems you mention in that point,
but the rewording may vary some from the text you suggest.

[21] Accepted.  We will add a reference, both here and in section
2.1.6, to section 1.2, step 3, where this is described.

[23] We would be happy to accept your offer to rewrite our example
using appropriate Japanese text.

[24] Accepted.

[30] How it would be spoken is processor-dependent. The <say-as>
element only provides information on how to interpret (or normalize)
a set of input tokens, not on how it is to be spoken.
Also, as you pointed out in point 27, "format='telephone'" is merely
an example and not a specified value, at least not at this time.

[31] Both are shown as examples to indicate two possible ways it
could be done. Neither is actually a specified way to use the
element, as you pointed out in point 27.

[33] In this example, without the detail attribute a processor might
leave out the colon or the dash, or it might not distinguish between
lower case and capital letters.
However, this is not actually a specified way to use the attribute,
as you pointed out in point 27.

[34] Rejected.  As you suggested in point 27, we will be removing all
of the tables of examples in this section. If and when we re-
introduce this table, we will correct any styling errors that remain.

[35] Accepted with changes.  This statement you refer to that is present
in all of the element descriptions will be modified to more fully
describe the content model for the element, although it may not be
worded exactly as you suggest.

[36] Rejected.  We have had considerable discussion on this point.
There are two parts to our response:

(1) It is assumed that the synthesis processor will use all contextual
information already at its disposal in order to render the text and
markup it is given.  For example, any relevant case or gender
information that can be determined from text surrounding the <say-as>
element is expected to be used.
(2) The ways and contexts in which information other than the specific number
value can be encoded via human language are many and varied. For example,
the way you count in Japanese varies based on the type of object that
you are counting. That level of complexity is well outside the intended
use of the <say-as> element. It is expected in such cases that either
the necessary contextual information is available, in normal surrounding
text, as described in part 1 above, or the text is normalized by the
application writer (e.g. "2" -> "zweiten").
We welcome any complete, multilingual proposals for consideration for a
future version of SSML.

[37] Rejected.  As you suggested in point 27, we will be removing these
examples altogether. If we should decide to reintroduce them at some
point, we would be happy to incorporate a revised or extended example
from you.

[41] Accepted.

[45] What would you suggest is the normal way?

[48] Rejected.  We have other elements such as <p> with the same
potential conflict. Also, we have not particularly crafted element
names to avoid conflicts with other markup vocabularies. We see no
direct need to change this element name.

[49] Accepted.  We will clarify within the text how application authors
should handle the cases presented in the referenced email.

[50] Accepted.

[52] Rejected.  This behavior is already permitted at processor
discretion for arbitrary-length strings of text. Specific words
or short phrases can be handled in a more predictable manner by
creating custom pronunciations in an external lexicon.
We do not believe this needs additional explanation in the document.

[53] We have not had significant demand to standardize a value for
this, e.g. <voice name="kids">. Individual processors are of course
permitted to provide any voices they wish.

[54] Accepted.  If you provided us with example text in Japanese here
we would be more than happy to include it.

[55] Accepted.

[56] Accepted. The text and schema will be adjusted to clarify that
this attribute can only contain positive integers.

[57] Rejected.  This is an interesting suggestion that we will be
happy to consider for the next version of SSML (after 1.0).

[61] Accepted.  We will add such an explanation.

[62] Accepted.

[63] Accepted.  We will add this.

[64] Accepted.

[65] Accepted.  We will add this.

[67] Accepted.

[71] Accepted.  We will add an example.

[73] Accepted.  We will clarify this.

[75] Accepted.

[77] Accepted.  We will make this change.

[79] Accepted.  We will correct this.

[81] Accepted with changes.  This was accidentally left in
when originally copied from the VoiceXML specification. It
will be corrected.

[86] Accepted.  This is old text. We will clarify.

[88] Accepted.



> -----Original Message-----
> From: Martin Duerst [mailto:duerst@w3.org]
> Sent: Friday, January 31, 2003 7:50 PM
> To: www-voice@w3.org
> Cc: w3c-i18n-ig@w3.org
> Subject: Consolidated comments on SSML
> 
> 
> 
> Dear Voice Browser WG,
> 
> These are the Last Call comments on Speech Synthesis
> Markup Language (http://www.w3.org/TR/speech-synthesis/)
> from the Core Task Force of the Internationalization (I18N) WG.
> Please make sure that you send all emails regarding these
> comments to w3c-i18n-ig@w3.org, rather than to me personally
> or just to www-voice@w3.org (to which we are not subscribed).
> 
> These comments are based on review by Richard Ishida and myself and
> have been discussed and approved the last I18N Core TF teleconference.
> They are ordered by section and numbered for easy reference.
> We have not classified these issues into editorial and substantial,
> but we think that it should be clear from their discription.
> 
> General:
> [01]  For some languages, text-to-speech conversion is more difficult
>        than for others. In particular, Arabic and Hebrew are usually
>        written with none or only a few vowels indicated. Japanese
>        often needs separate indications for pronunciation.
>        It was no clear to us whether such cases were considered,
>        and if they had been considered, what the appropriate
>        solution was.
>        SSML should be clear about how it is expected to handle these
>        cases, and give examples. Potential solutions we came up with:
>        a) require/recommend that text in SSML is written in an
>        easily 'speakable' form (i.e. vowelized for Arabic/Hebrew,
>        or with Kana (phonetic alphabet(s)) for Japanese. (Problem:
>        displaying the text visually would not be satisfactory in this
>        case); b) using <sub>; c) using <phoneme> (Problem: only
>        having IPA available would be too tedious on authors);
>        d) reusing some otherwise defined markup for this purpose
>        (e.g. <ruby> from http://www.w3.org/TR/ruby/ for Japanese);
>        e) creating some additional markup in SSML.
> 
> General: Tagging for bidirectional rendering is not needed
> [02]  for text-to-speech conversion. But there is some provision
>        for SSML content to be displayed visually (to cover WAI
>        needs). This will not work without adequate support of bidi
>        needs, with appropriate markup and/or hooks for styling.
> 
> General: Is there a tag that allows to change the language in
> [03]  the middle of a sentence (such as <html:span>)? If not,
>        why not? This functionality needs to be provided.
> 
> 
> Abstract: 'is part of this set of new markup specifications': 
> Which set?
> [04]
> 
> Intro: 'The W3C Standard' -> 'This W3C Specification'
> [05]
> 
> Intro: Please shortly describe the intended uses of SSML here,
> [06]   rather than having the reader wait for Section 4.
> 
> 
> Section 1, para 2: Please shortly describe how SSML and Sable are
> [07]  related or different.
> 
> 
> 1.1, table: 'formatted text' -> 'marked-up text'
> [08]
> 
> 1.1, last bullet: add a comma before 'and' to make
> [09]  the sentence more readable
> 
> 
> 1.2, bullet 4, para 1: It might be nice to contrast the 45 phonemes
> [10] in English with some other language. This is just one case that
>       shows that there are many opportunities for more internationally
>       varied examples. Please take any such oppurtunities.
> 
> 1.2, bullet 4, para 3: "pronunciation dictionary" ->
> [11] "language-specific pronunciation dictionary"
> 
> 1.2:  How is "Tlalpachicatl" pronounced? Other examples may be
> [12]  St.John-Smyth (sinjen-smaithe) or Caius College
>        (keys college), or President Tito (sutto) [president of the
>        republic of Kiribati (kiribass)
> 
> 
> 1.1 and 1.5: Having a 'vocabulary' table in 1.1 and then a
> [13] terminology section is somewhat confusing.
>       Make 1.1 e.g. more text-only, with a reference to 1.5,
>       and have all terms listed in 1.5.
> 
> 1.5: The definition of anyURI in XML Schema is considerably wider
> [14] than RFC 2396/2732, in that anyURI allows non-ASCII characters.
>       For internationalization, this is very important. The text
>       must be changed to not give the wrong impression.
> 
> 1.5 (and 2.1.2): This (in particular 'following the
> [15]  XML specification') gives the wrong impression of where/how
>       xml:lang is defined. xml:lang is *defined* in the XML spec,
>       and *used* in SSML. Descriptions such as 'a language code is
>       required by RFC 3066' are confusing. What kind of language code?
>       Also, XML may be updated in the future to a new version of RFC
>       3066, SSML should not restrict itself to RFC 3066
>       (similar to the recent update from RFC 1766 to RFC 3066).
>       Please check the latest text in the XML errata for this.
> 
> 
> 2., intro: xml:lang is an attribute, not an element.
> [16]
> 
> 2.1.1, para 1: Given the importance of knowing the language for
> [17] speech synthesis, the xml:lang should be mandatory on the root
>       speak element. If not, there should be a strong 
> injunction to use it.
> 
> 2.1.1: 'The version number for this specification is 1.0.': please
> [18] say that this is what has to go into the value of the 'version'
>       attribute.
> 
> 
> 2.1.2., for the first paragraph, reword: 'To indicate the natural
> [19] language of an element and its attributes and subelements,
>       SSML uses xml:lang as defined in XML 1.0.'
> 
> The following elements also should allow xml:lang:
> [20] - <prosody> (language change may coincide with prosody change)
>       - <audio> (audio may be used for foreign-language pieces)
>       - <desc> (textual description may be different from audio,
>            e.g. <desc xml:lang='en'>Song in Japanese</desc>
>       - <say-as> (specific construct may be in different language)
>       - <sub>
>       - <phoneme>
> 
> 2.1.2: 'text normalization' (also in 2.1.6): What does this mean?
> [21] It needs to be clearly specified/explained, otherwise there may
>       be confusion with things such as NFC (see Character Model).
> 
> 2.1.2, example 1: Overall, it may be better to use utf-8 rather than
> [22] iso-8859-1 for the specification and the examples.
> 
> 2.1.2, example 1: To make the example more realistic, in the paragraph
> [23] that uses lang="ja" you should have Japanese text - not 
> an English
>       transcription, which may not use as such on a Japanese 
> text-to-speech
>       processor. In order to make sure the example can be viewed even
>       in situations where there are no Japanese fonts available, and
>       can be understood by everybody, some explanatory text 
> can provide
>       the romanized from. (we can help with Japanese if necessary)
> 
> 2.1.2, 1st para after 1st example: Editorial.  We prefer "In the
> [24] case that a document requires speech output in a language not
>       supported by the processor, the speech processor 
> largely determines
>       the behavior."
> 
> 2.1.2, 2nd para after 1st example: "There may be variation..."
> [25] Is the 'may' a keyword as in rfc2119? Ie. Are you allowing
>       conformant processors to vary in the implementation of xml:lang?
>       If yes, what variations exactly would be allowed?
> 
> 
> 2.1.3: 'A paragraph element represents the paragraph structure'
> [26] -> 'A paragraph element represents a paragraph'. (same 
> for sentence)
>       Please decide to either use <p> or <paragraph>, but not both
>       (and same for sentence).
> 
> 
> 2.1.4: <say-as>: For interoperability, defining attributes
> [27] and giving (convincingly useful) values for these attributes
>       but saying that these will be specified in a separate document
>       is very dangerous. Either remove all the details (and then
>       maybe also the <say-as> element itself), or say that the
>       values given here are defined here, but that future versions
>       of this spec or separate specs may extend the list of values.
>       [Please note that this is only about the attribute values,
>        not the actual behavior, which is highly language-dependent
>        and probably does not need to be specified in every detail.]
> 
> 2.1.4, interpret-as and format, 6th paragraph: requirement that
> [28] text processor has to render text in addition to the indicated
>       content type is a recipe for bugwards compatibility (which
>       should be avoided).
> 
> 2.1.4, 'locale': change to 'language'.
> [29]
> 
> 2.1.4: How is format='telephone' spoken?
> [30]
> 2.1.4: Why are there 'ordinal' and 'cardinal' values for both
> [31]   interpret-as and format?
> 
> 2.1.4 'The detail attribute can be used for all say-as content types.'
> [32]   What's a content type in this context?
> 
> 2.1.4 detail 'strict': 'speak letters with all detail': As opposed
> [33]  to what (e.g. in that specific example)?
> 
> 2.1.4, last table: There seem to be some fixed-width aspects in the
> [34]   styling of this table. This should be corrected to 
> allow complete
>         viewing and printing at various overall widths.
> 
> 2.1.4, 4th para (and several similar in other sections):
> [35]  "The say-as element can only contain text." would be easier
>        to understand; we had to look around to find out whether the
>        current phrasing described an EMPTY element or not.
> 
> 2.1.4. For many languages, there is a need for additional information.
> [36]   For example, in German, ordinal numbers are denoted 
> with a number
>        followed by a period (e.g. '5.'). They are read 
> depending on case
>        and gender of the relevant noun (as well as depending 
> on the use
>        of definite or indefinite article).
> 
> 2.1.4, 4th row of 2nd table: I've seen some weird phone formats, but
> [37]  nothing quite like this! Maybe a more normal example would NOT
>        pronounce the separators. (Except in the Japanese 
> case, where the
>        spaces are (sometimes) pronounced (as 'no').)
> 
> 
> 2.1.5, <phoneme>:
> [38]  It is unclear to what extent this element is designed for
>        strictly phonemic and phonetic notations, or also (potentially)
>        for notations that are more phonetic-oriented than 
> usual writing
>        (e.g. Japanese kana-only, Arabic/Hebrew with full vowels,...)
>        and where the boundaries are to other elements such as <say-as>
>        and <sub>. This needs to be clarified.
> 
> 2.1.5 There may be different flavors and variants of IPA (see e.g.
> [39]  references in ISO 10646). Please make sure it is clear which
>        one is used.
> 
> 2.1.5 IPA is used both for phonetic and phonemic notations. Please
> [40]  clarify which one is to be used.
> 
> 2.1.5 This may need a note that not all characters used in IPA are
> [41]  in the IPA block.
> 
> 2.1.5 This seems to say that the only (currently) allowed value for
> [42]  alphabet is 'ipa'. If this is the case, this needs to be said
>        very clearly (and it may as well be defined as default, and
>        in that case the alphabet attribute to be optional). If there
>        are other values currently allowed, what are they? How are
>        they defined?
> 
> 2.1.5 'alphabet' may not be the best name. Alphabets are sets of
> [43]  characters, usually with an ordering. The same set of characters
>        could be used in totally different notations.
> 
> 2.1.5 What are the interactions of <phoneme> for foreign language
> [44]  segments? Do processors have to handle all of IPA, or only the
>        phonemes that are used in a particular language? 
> Please clarify.
> 
> 2.1.5, 1st example:  Please try to avoid character entities, as it
> [45] suggests strongly that this is the normal way to input 
> this stuff.
>       (see also issue about utf-8 vs. iso-8859-1)
> 
> 
> 2.1.5 and 2.1.6: The 'alias' and 'ph' attributes in some
> [46]  cases will need additional markup (e.g. for fine-grained
>        prosody, but also for additional emphasis, bidirectionality).
>        This would also help tools for translation,...
>        But markup is not possible for attributes. These attributes
>        should be changed to subelements, e.g. similar to the <desc>
>        element inside <audio>.
> 
> 2.1.5 and 2.1.6: Can you specify a null string for the ph and alias
> [47] attributes? This may be useful in mixed formats where the
>       pronunciation is given by another means, e.g. with ruby 
> annotation.
> 
> 
> 2.1.6 The <sub> element may easily clash or be confused with <sub>
> [48]  in HTML (in particular because the specification seems to be
>        designed to allow combinations with other markup vocabularies
>        without using different namespaces). <sub> should be renamed,
>        e.g. to <subst>.
> 
> 2.1.6 For abbreviations,... there are various cases. Please check
> [49]  that all the cases in
>        
> http://lists.w3.org/Archives/Member/w3c-i18n-ig/2002Mar/0064.html
>        are covered, and that the users of the spec know how to handle
>        them.
> 
> 2.1.6, 1st para: "the specified text" ->
> [50]   "text in the alias attribute value".
> 
> 
> 2.2.1, between the tables: "If there is no voice available for the
> [51]  requested language ... select a voice ... same language 
> but different
>        region..."  I'm not sure this makes sense.  I could 
> understand that
>        if there is no en-UK voice you'd maybe go for an en-US 
> voice - this
>        is a different DIALECT of English.  If there are no 
> Japanese voices
>        available for Japanese text, I'm not sure it makes 
> sense to use an
>        English voice. What happens in this situation?
> 
> 2.2.1 It should be mentioned that in some cases, it may make 
> sense to have
> [52]  a short piece of e.g. 'fr' text in an 'en' text been spoken by
>        an 'en' text-to-speech converter (the way it's often done by
>        human readers) rather than to throw an error. This is quite
>        different for longer texts, where it's useless to bother an
>        user.
> 
> 2.2.1: We wonder if there's a need for multiple voices (eg. A 
> group of kids)
> [53]
> 
> 2.2.1, 2nd example: You should include some text here.
> [54]
> 
> 2.2.1 The 'age' attribute should explicitly state that the integer
> [55]  is years, not something else.
> 
> 2.2.1 The variant attribute should say what it's index origin is
> [56]  (e.g. either starting at 0 or at 1)
> 
> 2.2.1 attribute name: (in the long term,) it may be desirable to use
> [57]  an URI for voices, and to have some well-defined format(s)
>        for the necessary data.
> 
> 2.2.1, first example (and many other places): The line break between
> [58]  the <voice> start tag and the text "It's fleece was 
> white as snow."
>        will have negative effects on visual rendering.
>        (also, "It's" -> "Its")
> 
> 2.2.1, description of priorities of xml:lang, name, variant,...:
> [59]  It would be better to describe this clearly as priorities,
>        i.e. to say that for voice selection, xml:lang has highest
>        priority,...
> 
> 
> 2.2.3 What about <break> inside a word (e.g. for long words such as
> [60]  German)? What about <break> in cases where words cannot
>        clearly be identified (no spaces, such as in Chinese, Japanese,
>        Thai). <break> should be allowed in these cases.
> 
> 2.2.3 and 2.2.4: "x-high" and "x-low": the 'x-' prefix is part of
> [61]  colloquial English in many parts of the world, but may be
>        difficult to understand for non-native English speakers.
>        Please add an explanation.
> 
> 
> 2.2.4: Please add a note that customary pitch levels and
> [62]  pitch ranges may differ quite a bit with natural 
> language, and that
>        "high",... may refer to different absolute pitch 
> levels for different
>        languages. Example: Japanese has general much lower 
> pitch range than
>        Chinese.
> 
> 2.2.4, 'baseline pitch', 'pitch range': Please provide definition/
> [63]   short explanation.
> 
> 2.2.4 'as a percent' -> 'as a percentage'
> [64]
> 
> 2.2.4 What is a 'semitone'? Please provide a short explanation.
> [65]
> 
> 2.2.4 In pitch contour, are white spaces allowed? At what places
> [66]  exactly? In "(0%,+20)(10%,+30%)(40%,+10)", I would propose
>        to allow whitespace between ')' and '(', but not elsewhere.
>        This has the benefit of minimizing syntactict differences
>        while allowing long contours to be formatted with line breaks.
> 
> 2.2.4, bullets: Editorial nit.  It may help the first time reader to
> [67]   mention that 'relative change' is defined a little 
> further down.
> 
> 2.2.4, 4th bullet: the speaking rate is set in words per minute.
> [68]  In many languages what constitutes a word is often difficult to
>        determine, and varies considerably in average length.
>        So there have to be more details to make this work 
> interoperably
>        in different languages. Also, it seems that 'words per minute'
>        is a nominal rate, rather than exactly counting words, which
>        should be stated clearly. An much preferable 
> alternative is to use
>        another metric, such as syllables per minute, which has less
>        unclarity (not
> 
> 2.2.4, 5th bullet: If the default is 100.0, how do you make it
> [69]  louder given that the scale ranges from 0.0 to 100.0?
>        (or, in other words, is the default to always shout?)
> 
> 2.2.4, Please state whether units such as 'Hz' are case-sensitive
> [70] or case-insensitive. They should be case-sensitive, because
>       units in general are (e.g. mHz (milliHz) vs. MHz (MegaHz)).
> 
> 
> 2.3.3 Please provide some example of <desc>
> [71]
> 
> 3.1  Requiring an XML declaration for SSML when XML itself
> [72] doesn't require an XML declaration leads to unnecessary
>       discrepancies. It may be very difficult to check this
>       with an off-the-shelf XML parser, and it is not reasonable
>       to require SSML implementations to write their own XML
>       parsers or modify an XML parser. So this requirement
>       should be removed (e.g. by saying that SSML requires an XML
>       declaration when XML requires it).
> 
> 
> 3.3, last paragraph before 'The lexicon element' subtitle:
> [73] Please also say that the determination of
>       what is a word may be language-specific.
> 
> 3.3 'type' attribute on lexicon element: What's this attribute used
> [74] for? The media type will be determined from the document that
>       is found at the 'uri' URI, or not?
> 
> 
> 4.1 'synthesis document fragment' -> 'speech synthesis 
> document fragment'
> [75]
> 
> 4.1  Conversion to stand-alone document: xml:lang should not
> [76] be removed. It should also be clear whether content of
>       non-synthesis elements should be removed, or only the
>       markup.
> 
> 
> 4.4 'requirement for handling of languages': Maybe better to
> [77] say 'natural languages', to avoid confusion with markup
>       languages. Clarification is also needed in the following
>       bullet points.
> 
> 
> 4.5  This should say that a user agent has to support at least
> [78] one natural language.
> 
> 
> App A: 'http://www.w3c.org/music.wav': W3C's Web site is www.w3.org.
> [79]   But this example should use www.example.org or www.example.com.
> 
> App B: 'synthesis DTD' -> 'speech synthesis DTD'
> [80]
> 
> App D: Why does this mentions 'recording'? Please remove or explain.
> [81]
> 
> App E: Please give a reference for the application to the 
> IETF/IESG/IANA
> [82]   for the content type 'application/ssml+xml'.
> 
> App F: 'Support for other phoneme alphabets.': What's a 
> 'phoneme alphabet'?
> [83]
> 
> App F, last paragraph: 'Unfortunately, ... no standard for designating
> [84]   regions...': This should be worded differently. RFC 
> 3066 provides
>         for the registration of arbitrary extensions, so that e.g.
>         en-gb-accent-scottish and en-gb-accent-welsh could be 
> registered.
> 
> App F, bullet 3: I guess you already know that intonation
> [85]   requirements can vary considerably across languages, so you'll
>         need to cast your net fairly wide here.
> 
> App G: What is meant by 'input' and 'output' languages? This is the
> [86]   first time this terminology is used. Please remove or clarify.
> 
> App G: 'overriding the SSML Processor default language': There should
> [87]   be no such default language. An SSML Processor may only
>         support a single language, but that's different from
>         assuming a default language.
> 
> 
> 
> Regards,   Martin.
> 
> 

Received on Thursday, 29 May 2003 01:19:09 UTC