Copyright ©1999 - 2003 W3C® (MIT, ERCIM, Keio), All Rights Reserved. W3C liability, trademark, document use and software licensing rules apply.
This document details the responses made by the Voice Browser Working Group to issues raised during the Last Call (beginning 2 December 2002 and ending 15 January 2003) review of Speech Synthesis Markup Language (SSML) Version 1.0. Comments were provided by Voice Browser Working Group members, other W3C Working Groups, and the public via the www-voice-request@w3.org (archive) mailing list.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
This document of the W3C's Voice Browser Working Group describes the disposition of comments as of December 4, 2003 on Speech Synthesis Markup Language (SSML) Version 1.0 Last Call. It may be updated, replaced or rendered obsolete by other W3C documents at any time.
Comments on this document and requests for further information should be sent to the Working Group's public mailing list www-voice@w3.org (archive). Note as a precaution against spam, you should first subscribe to this list by sending an email to <www-voice-request@w3.org> with the word subscribe in the subject line (include the word unsubscribe if you want to unsubscribe).
This document has been produced as part of the W3C Voice Browser Activity, following the procedures set out for the W3C Process. The authors of this document are members of the Voice Browser Working Group (W3C Members only).
This document describes the disposition of comments in relation to the Speech Synthesis Markup Language (SSML) Version 1.0 (http://www.w3.org/TR/2002/WD-speech-synthesis-20021202/). Each issue is described by the name of the commenter, a description of the issue, and either the resolution or the reason that the issue was not resolved.
Notation: Each original comment is tracked by a "(Change) Request" [SSCR] designator. Each point within that original comment is identified by a point number. For example, "SSCR5-1" is the first point in the fifth change request for the specification.
| Item | Commenter | Proposed disposition | Status |
| SSCR122-1 | Alberto Ciaramella | Rejected | Implicitly accepted |
| SSCR122-2 | Alberto Ciaramella | Accepted | Implicitly accepted |
| SSCR122-3 | Alberto Ciaramella | Rejected | Implicitly accepted |
| SSCR123-1 | Bob Edgar | Rejected | Implicitly accepted |
| SSCR124-1 | Susan Lesch | Accepted | Implicitly accepted |
| SSCR124-2 | Susan Lesch | Accepted with changes | Implicitly accepted |
| SSCR124-3 | Susan Lesch | Accepted | Implicitly accepted |
| SSCR125-1 | Susan Lesch | Accepted with changes | Implicitly accepted |
| SSCR125-2 | Susan Lesch | Accepted | Implicitly accepted |
| SSCR125-3 | Susan Lesch | Accepted | Implicitly accepted |
| SSCR125-4 | Susan Lesch | Accepted with changes | Implicitly accepted |
| SSCR126-1 | Alex Monaghan | Rejected | Implicitly accepted |
| SSCR126-2 | Alex Monaghan | Accepted | Implicitly accepted |
| SSCR126-3 | Alex Monaghan | Rejected | Implicitly accepted |
| SSCR126-4 | Alex Monaghan | Accepted with changes | Implicitly accepted |
| SSCR126-5 | Alex Monaghan | Rejected | Implicitly accepted |
| SSCR126-6 | Alex Monaghan | Rejected | Implicitly accepted |
| SSCR126-7 | Alex Monaghan | Rejected | Implicitly accepted |
| SSCR126-8 | Alex Monaghan | Rejected | Implicitly accepted |
| SSCR126-9 | Alex Monaghan | Accepted with changes | Implicitly accepted |
| SSCR126-10 | Alex Monaghan | Rejected | Implicitly accepted |
| SSCR126-11 | Alex Monaghan | Implicitly accepted | |
| SSCR126-12 | Alex Monaghan | Rejected | Implicitly accepted |
| SSCR126-13 | Alex Monaghan | Rejected | Implicitly accepted |
| SSCR126-14 | Alex Monaghan | Rejected | Implicitly accepted |
| SSCR127-1 | Dave Pawson | Accepted/Rejected | Implicitly accepted |
| SSCR127-2 | Dave Pawson | Accepted | Implicitly accepted |
| SSCR127-3 | Dave Pawson | Rejected | Implicitly accepted |
| SSCR128-1 | Dave Pawson | Rejected | Implicitly accepted |
| SSCR128-2 | Dave Pawson | Rejected | Implicitly accepted |
| SSCR129-1 | Dave Pawson | Accepted with changes | Accepted |
| SSCR129-2 | Dave Pawson | Rejected | Implicitly accepted |
| SSCR129-3 | Dave Pawson | Rejected | Implicitly accepted |
| SSCR129-4 | Dave Pawson | Accepted | Implicitly accepted |
| SSCR129-5 | Dave Pawson | Accepted | Implicitly accepted |
| SSCR129-6 | Dave Pawson | Accepted | Implicitly accepted |
| SSCR130-1 | Alex Monaghan | Rejected | Implicitly accepted |
| SSCR131-1 | Adhemar Vandamme | Rejected | Implicitly accepted |
| SSCR132-1 | Sobia Mahmud | N/A | Implicitly accepted |
| SSCR132-2 | Sobia Mahmud | N/A | Implicitly accepted |
| SSCR133-1 | Max Froumentin | Accepted/Question | Accepted |
| SSCR133-2 | Max Froumentin | Rejected | Accepted |
| SSCR133-3 | Max Froumentin | N/A | Accepted |
| SSCR133-4 | Max Froumentin | Accepted with changes | Accepted |
| SSCR133-5 | Max Froumentin | Accepted | Accepted |
| SSCR133-6 | Max Froumentin | Accepted | Accepted |
| SSCR133-7 | Max Froumentin | Rejected | Accepted |
| SSCR133-8 | Max Froumentin | Accepted | Accepted |
| SSCR133-9 | Max Froumentin | Accepted | Accepted |
| SSCR133-10 | Max Froumentin | N/A | Accepted |
| SSCR134-1 | Dave Pawson | Rejected | Accepted |
| SSCR134-2 | Dave Pawson | Accepted/Rejected | Accepted |
| SSCR135-1 | Dan Brickley | Accepted | Implicitly accepted |
| SSCR135-2 | Dan Brickley | N/A | Implicitly withdrawn |
| SSCR136-1 | Susan Lesch | Accepted | Accepted |
| SSCR136-2 | Susan Lesch | N/A | Implicitly accepted |
| SSCR136-3 | Susan Lesch | Rejected | Accepted |
| SSCR136-4 | Susan Lesch | Accepted | Accepted |
| SSCR137-1 | Dave Pawson | N/A | Accepted |
| SSCR138-1 | Marc Schroeder | Accepted with changes | Implicitly accepted |
| SSCR139-1 | Alex Monaghan | Accepted with changes | Implicitly accepted |
| SSCR140-1 | Andrew Thompson | N/A | Implicitly accepted |
| SSCR140-2 | Andrew Thompson | Rejected | Implicitly accepted |
| SSCR140-3 | Andrew Thompson | N/A | Implicitly accepted |
| SSCR140-4 | Andrew Thompson | Rejected | Implicitly accepted |
| SSCR140-5 | Andrew Thompson | Accepted/Rejected | Implicitly accepted |
| SSCR140-6 | Andrew Thompson | Rejected | Implicitly accepted |
| SSCR140-7 | Andrew Thompson | N/A | Implicitly accepted |
| SSCR140-8 | Andrew Thompson | N/A | Implicitly accepted |
| SSCR141-1 | Al Gilman | Rejected | Accepted |
| SSCR141-2 | Al Gilman | Accepted | Accepted |
| SSCR141-3 | Al Gilman | Rejected | Accepted |
| SSCR141-4 | Al Gilman | Accepted with changes | Accepted |
| SSCR142-1 | WAI-PF | Accepted with changes | Implicitly accepted |
| SSCR143-1 | Alex Monaghan | Accepted with changes | Implicitly accepted |
| SSCR144-1 | Richard Schwerdtfeger | Rejected | Accepted |
| SSCR145-1 | I18N Interest Group | Rejected | Accepted |
| SSCR145-2 | I18N Interest Group | Rejected | Accepted |
| SSCR145-3 | I18N Interest Group | n/a | Accepted |
| SSCR145-4 | I18N Interest Group | N/A | Accepted |
| SSCR145-5 | I18N Interest Group | Accepted | Accepted |
| SSCR145-6 | I18N Interest Group | Accepted | Accepted |
| SSCR145-7 | I18N Interest Group | Accepted | Accepted |
| SSCR145-8 | I18N Interest Group | Accepted | Accepted |
| SSCR145-9 | I18N Interest Group | Accepted | Accepted |
| SSCR145-10 | I18N Interest Group | Accepted | Accepted |
| SSCR145-11 | I18N Interest Group | Accepted with changes | Accepted |
| SSCR145-12 | I18N Interest Group | Accepted | Accepted |
| SSCR145-13 | I18N Interest Group | Accepted | Accepted |
| SSCR145-14 | I18N Interest Group | Accepted | Accepted |
| SSCR145-15 | I18N Interest Group | Accepted | Accepted |
| SSCR145-16 | I18N Interest Group | Accepted | Accepted |
| SSCR145-17 | I18N Interest Group | Accepted | Accepted |
| SSCR145-18 | I18N Interest Group | Accepted | Accepted |
| SSCR145-19 | I18N Interest Group | Accepted with changes | Accepted |
| SSCR145-20 | I18N Interest Group | Rejected | Accepted |
| SSCR145-21 | I18N Interest Group | Accepted | Accepted |
| SSCR145-22 | I18N Interest Group | Accepted with changes | Accepted |
| SSCR145-23 | I18N Interest Group | Accepted | Accepted |
| SSCR145-24 | I18N Interest Group | Accepted | Accepted |
| SSCR145-25 | I18N Interest Group | N/A | Accepted |
| SSCR145-26 | I18N Interest Group | Accepted | Accepted |
| SSCR145-27 | I18N Interest Group | Accepted | Accepted |
| SSCR145-28 | I18N Interest Group | Rejected | Withdrawn |
| SSCR145-29 | I18N Interest Group | Accepted | Accepted |
| SSCR145-30 | I18N Interest Group | N/A | Accepted |
| SSCR145-31 | I18N Interest Group | N/A | Accepted |
| SSCR145-32 | I18N Interest Group | N/A | Accepted |
| SSCR145-33 | I18N Interest Group | N/A | Accepted |
| SSCR145-34 | I18N Interest Group | Rejected | Accepted |
| SSCR145-35 | I18N Interest Group | Accepted with changes | Accepted |
| SSCR145-36 | I18N Interest Group | Rejected | Accepted |
| SSCR145-37 | I18N Interest Group | Rejected | Accepted |
| SSCR145-38 | I18N Interest Group | Accepted | Accepted |
| SSCR145-39 | I18N Interest Group | Accepted | Accepted |
| SSCR145-40 | I18N Interest Group | Accepted | Accepted |
| SSCR145-41 | I18N Interest Group | Accepted | Accepted |
| SSCR145-42 | I18N Interest Group | Accepted | Accepted |
| SSCR145-43 | I18N Interest Group | Accepted | Accepted |
| SSCR145-44 | I18N Interest Group | Accepted | Accepted |
| SSCR145-45 | I18N Interest Group | Accepted with changes | Accepted |
| SSCR145-46 | I18N Interest Group | Accepted/Rejected | Accepted |
| SSCR145-47 | I18N Interest Group | Rejected | Accepted |
| SSCR145-48 | I18N Interest Group | Rejected | Accepted |
| SSCR145-49 | I18N Interest Group | Accepted | Accepted |
| SSCR145-50 | I18N Interest Group | Accepted | Accepted |
| SSCR145-51 | I18N Interest Group | N/A | Accepted |
| SSCR145-52 | I18N Interest Group | Accepted | Accepted |
| SSCR145-53 | I18N Interest Group | N/A | Accepted |
| SSCR145-54 | I18N Interest Group | Accepted | Withdrawn |
| SSCR145-55 | I18N Interest Group | Accepted | Accepted |
| SSCR145-56 | I18N Interest Group | Accepted | Accepted |
| SSCR145-57 | I18N Interest Group | Rejected | Accepted |
| SSCR145-58 | I18N Interest Group | N/A | Withdrawn |
| SSCR145-59 | I18N Interest Group | Accepted with changes | Accepted |
| SSCR145-60 | I18N Interest Group | Accepted | Accepted |
| SSCR145-61 | I18N Interest Group | Accepted | Accepted |
| SSCR145-62 | I18N Interest Group | Accepted | Accepted |
| SSCR145-63 | I18N Interest Group | Accepted | Accepted |
| SSCR145-64 | I18N Interest Group | Accepted | Accepted |
| SSCR145-65 | I18N Interest Group | Accepted | Accepted |
| SSCR145-66 | I18N Interest Group | Accepted | Accepted |
| SSCR145-67 | I18N Interest Group | Rejected | Accepted |
| SSCR145-68 | I18N Interest Group | Accepted with changes | Accepted |
| SSCR145-69 | I18N Interest Group | N/A | Accepted |
| SSCR145-70 | I18N Interest Group | Accepted | Accepted |
| SSCR145-71 | I18N Interest Group | Accepted | Accepted |
| SSCR145-72 | I18N Interest Group | Accepted | Accepted |
| SSCR145-73 | I18N Interest Group | Accepted | Accepted |
| SSCR145-74 | I18N Interest Group | n/a | Accepted |
| SSCR145-75 | I18N Interest Group | Accepted | Accepted |
| SSCR145-76 | I18N Interest Group | Accepted | Accepted |
| SSCR145-77 | I18N Interest Group | Accepted | Accepted |
| SSCR145-78 | I18N Interest Group | Accepted | Accepted |
| SSCR145-79 | I18N Interest Group | Accepted | Accepted |
| SSCR145-80 | I18N Interest Group | Accepted | Accepted |
| SSCR145-81 | I18N Interest Group | Accepted with changes | Accepted |
| SSCR145-82 | I18N Interest Group | Accepted | Accepted |
| SSCR145-83 | I18N Interest Group | N/A | Accepted |
| SSCR145-84 | I18N Interest Group | Accepted | Accepted |
| SSCR145-85 | I18N Interest Group | N/A | Accepted |
| SSCR145-86 | I18N Interest Group | Accepted | Accepted |
| SSCR145-87 | I18N Interest Group | Accepted | Accepted |
| SSCR145-88 | I18N Interest Group | Accepted | Accepted |
| SSCR145-89 | I18N Interest Group | Rejected | Accepted |
| SSCR146-1 | XML Schema WG | Accepted/Rejected | Implicitly accepted |
| SSCR151-1 | Dave Pawson | Rejected | Accepted |
From Alberto Ciaramella
Here follow my comments about the Speech Synthesis mark up language specification of the Speech Interface Framework, draft dated 3 january 20001.
- 1) paragraph 1.2. It shows that the processing, in different stages, is influenced both by the mark up support and by not mark-up behaviour. I suggest to add here as a general rule that "explicit mark up always takes the precedence over not-mark up behaviour". This kind of rule in the present version of the document is presented as the usage note 2 at the end of 2.4, but it is definitely more general than this.
Proposed disposition: Rejected
Behavior in the specification is determined on an element-by-element basis because the markup in some cases might try to do something which the engine knows to be inappropriate. As an example, a prosody contour with sequential pitch targets that vary wildly will not be observed very closely by any commercial engine because the audio would be exceedingly unnatural and likely unintelligible. Additionally, requiring the markup behavior to take precedence would be difficult to enforce without audio checks that measure not just conformance, but performance. We do not believe it is appropriate for the specification to render too fine an opinion on performance.
Email Trail:
From Alberto Ciaramella
Here follow my comments about the Speech Synthesis mark up language specification of the Speech Interface Framework, draft dated 3 january 20001.
- 2) paragraph 1.2 point 6: waveform production mark up support. I do not agree that "the TTS markup does not provide explicit controls over the generation of the waveforms". In fact with mark-ups already introduced in point 5 for controlling the prosody you can control both the volume and the speed.
Proposed disposition: Accepted
We will remove this sentence.
Email Trail:
From Alberto Ciaramella
Here follow my comments about the Speech Synthesis mark up language specification of the Speech Interface Framework, draft dated 3 january 20001.
- 3) Other than this [comment 122-2], always in paragraph 1.2 point 6, it is advisable to identify if a sentence can or can not be interrupted by a barge-in. This feature is present in section 4.1.5 of the document "Voice Extensible mark up language", version 2. Thus poses another more general issue: what is the relationship between the document "Speech Synthesis mark up language" and the chapter 4 (System Output) of the Voice XML version 2. It must be explicitated, taking care not to duplicate the definitions between these documents in order to simplify the document maintenance.
Proposed disposition: Rejected
We believe this comment has been addressed by changes to both SSML and VoiceXML. Although examples of SSML embedded in other languages are appropriate for this document, specific details are not. Barge-in behavior, for example, is outside the scope of this specification.
Email Trail:
From Bob Edgar
Comment on:
Speech Synthesis Markup Language Specification for the Speech Interface Framework
W3C Working Draft 3 January 2001
http://www.w3.org/TR/2001/WD-speech-synthesis-20010103DTD gives "number:ordinal" and "number:digits" as valid say-as types, but not just "number", however the example in 2.4 has <say-as type="number">. According to my reading of the XML specification, and also according to Microsoft's validating parser, this is not allowed by the DTD, you would have to explicitly allow "number" -- there is no rule that says you can match a prefix of the attribute value. The same issue applies to date, time and duration types.
Bob.
Proposed disposition: Rejected
We agree there is an error. We rejected this request only because the <say-as> element no longer has this level of detail.
Email Trail:
From Susan Lesch
Just a few comments about the speech grammar and synthesis Last Call Working Drafts [1,2].
[1] http://www.w3.org/TR/2001/WD-speech-grammar-20010103/
[2] http://www.w3.org/TR/2001/WD-speech-synthesis-20010103/The markup, embedded CSS, and overall presentation is well done and easy to follow. One suggestion is to mark up elements and attributes as XHTML code, for example, <code>item</code>, rather than quote them as "item". No added color would be necessary.
Proposed disposition: Accepted
We will mark up SSML elements and attributes along the lines of the approach used in the XHTML2 Working Drafts.
Email Trail:
From Susan Lesch
Just a few comments about the speech grammar and synthesis Last Call Working Drafts [1,2].
[1] http://www.w3.org/TR/2001/WD-speech-grammar-20010103/
[2] http://www.w3.org/TR/2001/WD-speech-synthesis-20010103/(2) Regarding the embedded style, I have been told that hex values are better supported in old browsers than RGB. In Netscape 4.x Mac, "list-style: none;" renders as a question mark for each bullet and should be omitted. In IE 3.x Mac, "background" color is supported; ("background-color" is not). Also, when declaring a background color, a text color is needed; #000 would be fine.
Proposed disposition: Accepted with changes
Your point about cross-browser styling is a good one. From the next draft of the specification onwards we will be following Dave Raggett's style guide as much as possible.
Email Trail:
From Susan Lesch
Just a few comments about the speech grammar and synthesis Last Call Working Drafts [1,2].
[1] http://www.w3.org/TR/2001/WD-speech-grammar-20010103/
[2] http://www.w3.org/TR/2001/WD-speech-synthesis-20010103/(3) Finally, both Working Drafts have extensive "future study" sections. When and if the drafts move to higher maturity levels, I think these sections should be cut and moved to another location, possibly linked from the Voice Browser Activity home page. Somehow speculating would seem out of place in a Recommendation. What do you think?
Proposed disposition: Accepted
This section has already been significantly reduced in the last draft. We agree that it should be removed entirely as the document moves closer to Recommendation.
Email Trail:
From Susan Lesch
(1) The title "Speech Synthesis Markup Language Specification for the
Speech Interface Framework" is quite long. How does "Speech
Synthesis Markup Language (SSML) Specification Version 1.0" sound?
Proposed disposition: Accepted with changes
We have accepted this with a slight modification: Speech Synthesis Markup Language Version 1.0.
Email Trail:
From Susan Lesch
(2) The references need development and could work without hopping to
other specs. In 1., JSML can be a link [JSML] to your normative
references in section 7. In 1.3, 2.8, and 2.9 there can be a link
[CSS2]. In 2.2, I would use XML 1.0 section 2.12 as the authority
with a [XML] link to your normative references.
Proposed disposition: Accepted
All non-SSML links within the document now point to an item in the References section which itself provides the official external reference and link.
Email Trail:
From Susan Lesch
(3) These documents could all be normative references rather than links
in the running text; I may have missed some:
Speech Synthesis Markup Requirements for Voice Markup Languages
Cascading Style Sheets, level 2
RFC 1766
Extensible Markup Language (XML) 1.0 (Second Edition)
Namespaces in XML
International Phonetic Alphabet
ASCII Phonetic Symbols for the World's Languages: Worldbet
Computer-coding the IPA: a proposed extension of SAMPAA good model is HTML at http://www.w3.org/TR/html401/references.html.
(The definition terms like [CSS2] can be black.)
Proposed disposition: Accepted
All non-SSML links within the document now point to an item in the References section which itself provides the official external reference and link.
Email Trail:
From Susan Lesch
(4) Below, a section number is followed by a quote and then a suggestion.
Abstract
working group
Working Group
[For W3C entities, you can copy capitalization conventions from the
W3C Process document at http://www.w3.org/Consortium/Process/.]web [twice]
Weband etc.
etc.1.
Standard
standard1. par. 2
web
Weband etc.
etc.1.1 list item 2
Audio Cascading Style Sheets
aural Cascading Style SheetsIn 1.2, the items with hyphens for bullets could be unordered lists.
Also, in list item 5, "- Markup support" is br-separated unlike the
others.1.2 list item 4
names; e.g.
names, e.g.1.3 list 1 item 3
as many details
[does this mean "many details" or "as many details as are necessary"?]1.3 list 2 item 2
Interoperability with Aural CSS:
Interoperability with aural CSS (ACSS):Aural CSS-enabled
aural CSS-enabled1.3 list 2 item 3
style-sheet [twice]
style sheet2.3 last par.
an enclosing paragraph or sentence elements
an enclosing paragraph or sentence element2.4 Time, Date and Measure Types last par.
separated by single, non-alphanumeric character.
separated by a single, non-alphanumeric character.2.4 Address, Name, Net Types list item 2
internet
Internet2.4 third example
acme.com is a registered domain. W3C recommends using example.com,
example.org, or example.net which IANA has reserved for examples.
Please see RFC 2606 section 3 at http://www.ietf.org/rfc/rfc2606.txt.2.5 list item 2
Postscript
PostScript2.6 list item 2 needs an ending period.
2.8 list item 2 and 2.9 duration
It follows the "Times" attribute format from the
Cascading Style Sheet Specification. e.g. "250ms", "3s".
could read [five changes here]:
It follows the "time" attribute format from the
Cascading Style Sheet Level 2 Recommendation [CSS2],
e.g. "250ms", "3s".2.9 Relative values
SSML
[The acronym would work fine throughout the spec. I would use it
in parentheses after the first occurrence of "Speech Synthesis
Markup Language" in the Abstract or section 1 and thereafter.]2.9 Pitch contour
attribute (absolute,
attribute; (absolute,2.9 last word
minute.)
minute).2.10 par. 1
mime-type
MIME type3.1
Lernout and Hauspie Speech Products
Lernout & Hauspie Speech Products3.3
dialog markup language
[Is this "Dialog Markup Language"?]3.6 par. 1
string (markup
string; (markup4.
Informative.
informative.4. second example
Lee
Berners-Lee<audio src="http://www.w3c.org/music.wav">
<audio src="http://w3c.example.org/music.wav">5.
Normative.
normative.The second paragraph needs an ending period.
5.1 list item 1
(relative to XML) is well-formed.
is well-formed XML [XML section 2.1]5.1 list item 2
is a valid XML document
is a valid XML document [XML section 2.8]5.3 par. 2
XML Namespaces.
Namespaces in XML.6.
Normative.
normative.7. Informative
http://www.voicexml.com/ [twice]
http://www.voicexml.org/
Proposed disposition: Accepted with changes
Most of your changes were accepted verbatim. For the remaining cases the problem you implied was corrected via other text.
Email Trail:
From Alex Monaghan
It is not clear who the intended users of this markup language are. There are two obvious types of possible users: speech synthesis system developers, and application developers. The former may well be concerned with low-level details of timing, pitch and pronunciation, and be able to specify these details (F0 targets, phonetic transcriptions, pause durations, etc.). The latter group are much more likely to be concerned with specifying higher-level notions such as levels of boundary, degrees of emphasis, fast vs slow speech rate, and formal vs casual pronunciation. The proposal appears to be aimed at both groups, but no indication is given as to which aspects of the markup language are intended for which group.
Distinguish clearly between tags intended for speech synthesis developers and tags intended for application designers.
Proposed disposition: Rejected
We believe that all the tags are appropriate for and needed by application developers. Commercial deployments of SSML so far appear to have borne out this conclusion.
Email Trail:
From Alex Monaghan
It is clear that the proposal includes two, and in some cases three, different levels of markup. For F0, for instance, there is the <emphasis> tag (which would be realised as a pitch excursion in most systems), the <prosody contour> tag which allows finer control, and the low-level <pitch> tag which is proposed as a future extension. There is very little indication of best practice in the use of these different levels (e.g. which type of user should use which level), and no explanation of what should happen if the different levels are combined (e.g. a <pitch contour> specification inside an <emphasis>environment).
Clarify the intended resolution of conflicts between high-level and low-level markup, or explain the dangers of using both types in the same document. This would be simpler if there were two distinct levels of markup.
Proposed disposition: Accepted
This is an excellent point. We will note the dangers as you suggest. We will also note that although the behaviors of the individual elements are specified, details about how conflicts are resolved are implementation specific.
Email Trail:
From Alex Monaghan
We strongly suggest that some distinction between high-level markup (specifying the function or structure of the input) and low-level markup (specifying the form of the output) be introduced, ideally by providing two explicit markup sublanguages. The users of these sublanguages are unlikely to overlap. Moreover, while most synthesisers might support one level of markup or the other, there are currently very few synthesisers which could support both.
Perhaps two separate markup languages (high-level and low-level) should be specified. This would have the desirable side-effect of allowing a synthesiser to comply with only one level of markup, depending on the intended users.
Proposed disposition: Rejected
There are certainly complete implementations of SSML today that implement both high and low level tags. This separation is something we will consider for a later version of SSML (beyond 1.0). For this specification we will add a note that although the tags themselves may be supported, details of the interactions between the two levels are implementation specific. We will encourage developers to use caution in mixing them arbitrarily.
Email Trail:
From Alex Monaghan
The notion of "non-markup behavior" is confusing. On the one hand, there seems to be an assumption that markup will not affect the behaviour of the system outside the tags, and that the markup therefore complements the system's unmarked performance, but on the other hand there are references to "over-riding" the system's default behaviour. In general, it is unclear whether markup is intended to be superimposed on the default behaviour or to provide information which modifies that behaviour. The use of the <break> element, for instance, is apparently intended "to override the typical automatic behavior", but the insertion of a <break> tag may have non-local repercussions which are very hard to predict. Take a system which assigns prosodic boundaries stochastically, and attempts to balance the number and length of units at each prosodic level. The "non-markup behavior" of such a system might take the input "Big fat cigars, lots of money." and produce two balanced units: but will the input "Big fat <break/> cigars, lots of money." produce three unbalanced units (big fat, cigars, lots of money), or three more balanced units (big fat, cigars lots, of money), or four balanced units (big fat, cigars, lots of, money), or six single-word units, or something else? Which would be the correct interpretation of the markup?
Clarify the intended effect of tags on the default behaviour of synthesis systems. Should they be processed BEFORE the system performs its "non-markup behavior", or AFTER the default output has been calculated? Does this vary depending on the tag? Again, this may be resolved by introducing two distinct levels of markup.
Proposed disposition: Accepted with changes
This is a good point. As you surmised, the behavior does vary depending on the tag, largely because the processor has the ultimate authority to ensure that what it produces is pronounceable (and ideally intelligible). In general the markup provides a way for the author to make prosodic and other information available to the processor, typically information the processor would be unable to acquire on its own. It is up to the processor to determine whether and in what way to use the information.
We will provide some additional text to clarify this behavior.
Email Trail:
From Alex Monaghan
Many of the tags related to F0 presuppose that pitch is represented as a linear sequence of targets. This is the case for some synthesisers, particularly those using theories of intonation based on the work of Bruce, Ladd or Pierrehumbert. However, tlihe equally well-known Fujisaki approach is also commonly used in synthesis systems, as are techniques involving the concatenation of natural or stylised F0 contours: in these approaches, notions such as pitch targets, baselines and ranges have very different meanings and in some cases no meaning at all. The current proposal is thus far from theory-neutral, and is not implementable in many current synthesisers.
Revise the F0 tags to allow for theory-neutral interpretation: if this is not done, the goal of interoperability across synthesis platforms cannot be achieved.
Proposed disposition: Rejected
It is outside the scope of this group to design a theory-neutral approach. We are not aware of the existence of such an approach, and so far in commercial systems we have seen considerable support for the current approach. There is also no requirement within the specification that any of the theories you mention be used in implementation. Rather, F0 variation is expressed in terms of pitch targets but can be mapped into any underlying model the processor wishes.
Email Trail:
From Alex Monaghan
There is no provision for local or language-specific additions, such as different classes of abbreviations (e.g. the distinction between a true acronym such as DEC and an abbreviation such as NEC), different types of numbers (animate versus inanimate in many languages), or the prosodic systems of tone languages. Some specific examples are discussed below, but provision for anything other than English is minimal in the current proposal. As compliant systems extend their language coverage, they should be able to add the required markup in a standard way, even if it has not been foreseen by the W3C.
Provide a mechanism for extending the standard to include unforeseen cases, particularly language-specific or multilingual requirements.
Proposed disposition: Rejected
It is difficult, if not impossible, to incorporate a generic mechanism that will work for all of the language features you're describing, in addition to unforseen features, in a standard manner. It may be possible to have extensions to the specification later on as we discover standardized ways to provide the information you suggest. We welcome your input for such future extensions.
Email Trail:
From Alex Monaghan
<say-as>: Several categories could be added to this tag, including credit card numbers (normally read in groups) and the distinction between acronyms (DEC, DARPA, NASA) and letter-by-letter abbreviations (USA, IBM, UK).
Add the categories mentioned above.
Proposed disposition: Rejected
These are good suggestions. However, we have removed all attribute values and their definitions from the <say-as> element. To avoid inappropriate assumptions about what is specified, we will also be removing the examples from the <say-as> section. We expect to begin work on specifying the details of the <say-as> element when SSML 1.0 reaches the Candidate Recommendation stage. We will consider your suggestions at that time.
Email Trail:
From Alex Monaghan
In languages with well-developed morphology, such as Finnish or Spanish, the pronunciation of numbers and abbreviations depends not only on whether they are ordinal or cardinal but also on their gender, case and even semantic properties. These are often not explicit, or even predictable, from the text. It would be advisable to extend the <sayas> tag to include an optional attribute to hold such information.
Proposed disposition: Rejected
We are aware of this issue and have considered it again in response to your input, but we are not prepared to address it at this time. As you point out, there is broad variability in the categories and structure of this information. The <say-as> element is only designed to indicate simple structure for cases where the synthesis processor is unable to determine it on its own. Where large amounts of context-dependent information would be required in order to adequately inform the processor, we would recommend not using the <say-as> element at all. Rather, we recommend that numbers and abbreviations be instead written out orthographically, as is possible with any text over which the application writer wishes absolute control.
Email Trail:
From Alex Monaghan
<voice> element: It seems unnecessary to reset all prosodic aspects to their defaults when the voice changes. This prevents the natural-sounding incorporation of direct speech using a different voice, and also makes the reading of bilingual texts (common in Switzerland, Eastern Europe, the Southern USA, and other exotic places) very awkward. Although absolute values cannot be carried over from voice to voice, it should be possible to transfer relative values (slow/fast, high/medium/low, etc.) quite easily.
Allow the option of retaining relative prosodic attributes (pitch, rate, etc.) when the voice is changed.
Proposed disposition: Accepted with changes
We agree in principle with your suggestion. We will remove the contentious paragraph and replace it with one explaining that
- relative changes in prosodic parameters are expected to be carried across voice changes, but
- different voices have different natural defaults for pitch, speaking rate, etc. because they represent different personalities, so
- absolute values of the prosodic parameters may vary across changes in the voice.
Email Trail:
From Alex Monaghan
<break> element: Some languages have a need for more levels of prosodic boundary below a minor pause, and some applications may require boundaries above the paragraph level. It would be advisable to add an optional "special" value for these cases.
Add an optional "special" attribute to allow language-specific and application-specific extensions.
Proposed disposition: Rejected
This is a good suggestion, but it is too extensive to add to the specification at this time. This feature will be deferred to the next version of SSML.
Email Trail:
From Alex Monaghan
<prosody> element: There is currently no provision for languages with lexical tone. These include many commercially important languages (e.g. Chinese, Swedish, Norwegian), as well as most of the other languages of the world. Although tone can be specified in a full IPA transcription, the ability to specify tone alongside the orthography would be very useful.
Add an optional "tone" attribute.
Proposed disposition:
It is unclear how you would expect this to work. As you point out, this can be specified in full IPA, which is possible with the phoneme element today.
How would you envision specifying tone *alongside* the orthography?
Email Trail:
From Alex Monaghan
<rate> element: There is currently no unit of measurement for this tag. The "Words per minute" values suggested in the previous draft were at least a readily understandable measure of approximate speech rate. If their approximate nature were made explicit, these could function as indicative values and would be implementable in all synthesisers.
Proposed disposition: Rejected
Because of the difficulty in accurately defining the meaning of words per minute, syllables per minute, or phonemes per minute across all possible languages, we have decided to replace such specification with a number that acts as a multiplier of the default rate. For example, a value of 1 means a speaking rate equal to the default rate, a value of 2 means a speaking rate twice the default rate, and a value of 0.5 means a speaking rate of half the default rate. The default rate is processor-specific and will usually vary across both languages and voices. Percentage changes relative to the current rate are still permitted. Note that the effect of setting a specific words per minute rate (for languages for which that makes sense) can be achieved by explicitly setting the duration for the contained text via the duration attribute of the <prosody> element. The duration attribute can be used in this way for all languages and is therefore the preferred way of precisely controlling the rate of speech when that is desired.
Email Trail:
From Alex Monaghan
<rate> element: It is equally important to be able to specify the dynamics of speech rate - accelerations, decelerations, constancies. These are not mentioned in the current proposal.
Proposed disposition: Rejected
These are good suggestions, but they are too extensive to add to the specification at this time. These features will be deferred to the next version of SSML.
Email Trail:
From Alex Monaghan
<audio> element: Multimodal systems (e.g. animations) are likely to require precise synchronisation of audio, images and other resources. This may be beyond the scope of the proposed standard, but could be included in the <lowlevel> tag.
Consider a <lowlevel> extension to allow synchronisation of speech with other resources.
Proposed disposition: Rejected
As you suggest, this class of additions is outside the scope of the specification. We think it likely that other specifications such as SMIL would be more appropriate for this functionality. To the best of our knowledge, there are no major technical problems with integration of SMIL and SSML functionality.
Email Trail:
From Dave Pawson
Firstly a little backgound.
We have been using Text to Speech for about 18 months,
to produce alternative media for visually impaired customers.
We have learned over that time just what type of material
is suitable.
Our needs are:
XML source.
Ability to insert external audio files into the audio stream
(audible navigation points, tone bursts at 55 hz which are
findable when tape is played fast forward).
Ability to add to a dictionary / word set those words which
the synth gets wrong.
Ability to id and have spoken correctly standard items such
as dates, acronyms etc.
Proposed disposition: Accepted/Rejected
SSML 1.0 is based on XML.
It is possible to insert external audio files into the audio stream using the <audio> element.
It is possible, via the <lexicon> element, to add to a lexicon those words which the synth gets wrong.
We have removed the specification for interpretation hints for dates, etc. (part of the <say-as> element) but intend to reactivate that work as a separate activity when SSML 1.0 reaches the Candidate Recommendation stage. We will consider your suggestion "Ability to id and have spoken correctly standard items such as dates, acronyms etc." at that time.
Email Trail:
From Dave Pawson
We use silences to good effect, as user research has shown.
I'd love to see <break time="2S"/>
Proposed disposition: Accepted
This capability is in the most recent draft of the specification.
Email Trail:
From Dave Pawson
[Ed: this is with regard to the <prosody> rate attribute]
Provide a rate of 1 to 100, let the synth people interpret that for their engines, and users select appropriately by experiment.
Proposed disposition: Rejected
Because of the difficulty in accurately defining the meaning of words per minute, syllables per minute, or phonemes per minute across all possible languages, we have decided to replace such specification with a number that acts as a multiplier of the default rate. For example, a value of 1 means a speaking rate equal to the default rate, a value of 2 means a speaking rate twice the default rate, and a value of 0.5 means a speaking rate of half the default rate. The default rate is processor-specific and will usually vary across both languages and voices. Percentage changes relative to the current rate are still permitted. Note that the effect of setting a specific words per minute rate (for languages for which that makes sense) can be achieved by explicitly setting the duration for the contained text via the duration attribute of the <prosody> element. The duration attribute can be used in this way for all languages and is therefore the preferred way of precisely controlling the rate of speech when that is desired.
This approach differs notably from your suggestion in that there is no maximum rate value. If this particular feature (maximum rate value) is important for you, could you provide some sample use cases?
Email Trail:
From Dave Pawson
A command line application.
application xml-file, output audio file.
(rather than being buried in some app I haven't a clue about).
? Off topic I suppose :-)
Proposed disposition: Rejected
This is out of scope for the language, although nothing prevents the use of an SSML processor to send its generated audio stream to a file.
Email Trail:
From Dave Pawson
2. That the tts system be able to analyse a document and tell me that it has never heard of word xxxx, rather than creating a 40 minute recording, only to find that after 35 minutes, a word has to be added to its dictionary and the whole job done again. Thats a real bummer.
Even if it pulled them, pronounced each one so that we could check that it was a good guess/risible would help.
Or it pulled them, marked them up and exported them as XML for that level of checking!
Proposed disposition: Rejected
These suggestions are all for the processor and not the language. As such, they are out of scope.
Email Trail:
From Dave Pawson
Analysis:1.2, list item 4, para 3.
"TTS systems are expert at performing text-topohoneme conversions
so most words of most documents can be handled automatically".
Rather too sweeping for my liking. Certainly not the case for
the systems I've seen :-)
The VBWG asked for a specific suggestion, which Dave then provided.
Proposed disposition: Accepted with changes
We believe there is a misunderstanding that is simple to correct. There is already an ability in the specification to adjust pronunciation both internally via the phoneme element and externally via a lexicon. We agree there are times when one needs a lexicon. By placing better pronunciations for words in an external lexicon, the processor will automatically use the values in the lexicon over its own defaults without any additional markup (except for the single use of the <lexicon> element at the top of the document that points to the lexicon definition file). We also agree that the specification wording you quote unintentionally implies a claim about the quality of today's synthesis technology. To correct this, we will change "are expert at performing" to "are designed to perform".
Email Trail:
From Dave Pawson
2.4 Sub attribute.
A nice feature for a user would be to permit these to be collated
externally, and passed in as a sort of configuration file.It would save typing for regularly repeated occurrences.
<sub>
<el>W3C</el>
<use>World Wide Web Consortium</use>
</sub>or something similar?
Proposed disposition: Rejected
This request is similar to some earlier work by the Voice Browser Working Group on a standardized lexicon format (containing pronunciations for tokens and phrases). Your request is one that might best be considered for that effort if and when it re-activates. We encourage you to resubmit this request to the Working Group at that time.
Email Trail:
From Dave Pawson
2.8 break element.
A refinement on this would be the ability to explicitly state
the required duration for various punctuation elements and other
break types (paragraph, sentance).Again suggest this be externally configurable, for re-use optimisation.
Proposed disposition: Rejected
This concept has been considered but rejected as part of SSML 1.0. Rather, we encourage the use of style sheets or transformations to enable this macro-like behavior. It is possible that future versions of SSML beyond version 1.0 could permit default value setting for items such as paragraph and sentence prosody, but this kind of manipulation today is discouraged by most commercial synthesis engine developers on anything other than the occasional basis enabled by the <break> element.
Email Trail:
From Dave Pawson
2.10 Usage note 1. Could be confusion between this and 3.2.
If the default is to pause conversion till the audio is complete,
then it should be explicitly stated here. I support that requirement btw.
Proposed disposition: Accepted
We have removed the Future Study text from the document. Playback of recorded audio occurs in sequence with preceding and following synthesis, matching what you prefer. To obtain background playing, mixing, etc. we would recommend using SMIL.
Email Trail:
From Dave Pawson
2.12 usage note. Why hasn't a namespace been explicitly called up?
This would then nullify the requirement stated in 5.1 (I can't see
any need for that requirement. Is it justifiable?)
Proposed disposition: Accepted
The most recent draft of the specification contains a namespace definition and more careful conformance language with respect to non-standard extensions.
Email Trail:
From Dave Pawson
3.6 Value.
I suspect that the overall impact of this may be achieved by a simple
XSLT transform anyway, which may make this redundant?
Proposed disposition: Accepted
We have removed this text from the specification. Such functionality is expected to be achieved through the use of style sheets (ACSS/XSLT), as you suggest.
Email Trail:
From Alex Monaghan
Subject: SSML specification: <say-as type="date:dml">
i don't think anyone has mentioned this before - apologies if i'm wrong! for some reason, "dm" has been omitted as a format value for the "date" value of the "type" attribute of the "say-as" element. still with me? this is probably the most common way of writing a date in most european languages: a meeting on 13/5, a conference from 30/11 to 3/12, etc. i assume there's no good reason not to include it - can it be added, please?
Proposed disposition: Rejected
This value has been discussed within the Working Group. It will be considered when the group resumes definition of the values for the <say-as> attributes, currently planned to begin when the SSML 1.0 specification reaches the Candidate Recommendation stage.
Email Trail:
From Adhemar Vandamme
Why does the SSML need a mark-tag with name-attribute to place a marker
into the text/tag sequence and contain text that is used to reference a
special sequence of tags and text, either for internal reference within
the SSML document, or externally by another document?Can some-one explain to me why this can't be done with an id-attribute in
an arbitrary tag, like in many other XML specifications (e.g. XHTML)?If no the text that should be referenced is not enclosed in a tag yet, I
suggest using a span-tag, for consistency with XHTML.I give an example:
<?xml version="1.0" encoding="ISO-8859-1"?> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US"> <p id="marks"> <s> We would like <span id="congrats"> to extend our warmest congratulations </span> to the members of the Voice Browser Working Group! </s> <s id="really"> Really, we would. </s> </p> <p> <s> Go from <span id="here" /> here, to <span id="there" /> there! </s> <p> </speak>Herein, a full paragraph, a part of a sentence, a full sentence and two
"moments" are marked, using an id-attribute in a p-tag or an s-tag when
available, and in a span-tag otherwise.
Proposed disposition: Rejected
Upon consideration it has become even clearer to us that mark labels should not even be xsd:id's. As an example, the uniqueness constraints of ID's are a hindrance rather than a benefit, e.g. it may well be desirable to repeatedly use the same mark label (equivalent to repeatedly sending back the same event). We also wish to permit integer labels, for example. Because of this desire to have fewer restrictions than those introduced by ID's, we have decided to change the name attribute to be of type xsd:token.
Email Trail:
From Sobia Mahmud
1. Firstly in the Voice tag , what does the Name and the variant attributes do. Are they essential part of the voice tag?
Proposed disposition: N/A
The name and variant attributes are indeed essential. There are many criteria that can be used to select a specific voice, more than we wish to standardize. For application writers who have precise requirements as to the voices to use, the name and variant attributes provide an ability to be as precise as the underlying processor requires. As an example, one might use the name attribute to select a specific named voice, e.g. "BobSmith1".
Email Trail:
From Sobia Mahmud
2. In the prosody tag, it is explained that pitch contour comprises of the interval and the target. What input is user expected to provide in order to define the pitch contour?
Proposed disposition: N/A
The subsection "Pitch Contour" in Section 2.2.4 describes the format for contour specifications. Is there a particular part of this description that you find incomplete?
Email Trail:
From Max Froumentin
1. Why is the xml declaration mandatory? This goes against the XML
conformance rules, and it means that a standard XML parser could
not be used as it would accept the absence of a declaration. Since
this is mentioned twice, I imagine that the WG had a good reason
to do so, and it would be nice to find why in the spec.1.5. Similarly, why is the SSML namespace declaration mandatory?
1.6 Section 3.1 seems to mandate the use of xsi as the prefix of
schemaLocation.
Proposed disposition: Accepted/Question
1. The xml declaration is not intended to be mandatory. We will correct the error.
1.5 We do not understand your concern with the SSML namespace declaration. Can you elaborate?
1.6 The section 3.1 text regarding the prefix for schemaLocation will be changed to permit any prefix to be defined for the Schema schema.
Email Trail:
From Max Froumentin
2. Why do all the examples link to the schema? It makes them
less easy to read, and gives the impression that schemaLocation
is mandatory.
Proposed disposition: Rejected
We have received comments from other reviewers that our examples should be complete stand-alone documents. As a result, the Voice Browser Working Group has taken the following position with respect to all of its specifications:
We recommend, but do not require, the use of schema. For that reason, our examples all contain references to the SSML schema.
We will clarify this in the specification.
Email Trail:
From Max Froumentin
Analysis:3. I have trouble understanding this, in 2.1.5: "It is an error if a
value for alphabet is specified that is not known or cannot be
applied by an SSML processor.", where "error" is defined as a violation
of the spec.The test above indicates that values other than 'ipa' are allowed
for alphabet, so this would mean that if a processor doesn't
understand the value "xyz" (which a SSML producer has just come up
with), then the processor violates the spec?
VBWG responded that the SSML document violates the spec in this case and that error reporting is not required.
Max replied:
So an SSML document can violate the specification because it has a value for 'alphabet' that is not given by a given processor but works in another? I would instead say that a conformant processor may not support a given alphabet but must report an error. Maybe the QA people at W3C could help clarify
Note that in 2.1.2 'conformant' is used. It should be 'conforming'.
Proposed disposition: N/A
We propose to make the following changes:
- drop the text ", where "error" is defined as a violation of the spec." in 2.1.5 and just make sure that the word "error" earlier in the sentence is linked to our definition of error
- remove the text "A violation of the rules of this specification;" in the definition of error in section 1.5.
Email Trail:
From Max Froumentin
Analysis:4. in 2.2.1, the age attribute is defined as being of type "integer".
that should be positive integer.The style used for '(integer)' seems to indicate a formal reference
to a type. If it were, this would be more accurately described as
XML Schema's nonNegativeInteger. Ditto for the variant attribute which would
have to refer to xsd:integer
The VBWG agreed that type definitions should be more precise.
Max rejected the response as being inadequate.
Proposed disposition: Accepted with changes
We accept the point you're making and will add some text that more precisely identifies the type, although it may not be exactly the text you gave.
Email Trail:
From Max Froumentin
5. "Durations follow the "Times" attribute format from the [CSS2]
specification". I think this should be phrased as: "Durations
follow the <time> basic data type from the [CSS2] specification".
Proposed disposition: Accepted
We will correct this.
Email Trail:
From Max Froumentin
6. The definition of number in 2.2.4
"A number is a simple floating point value without exponentials."
insert 'positive'. (sorry to be pedantic ;-)
Proposed disposition: Accepted
We will make this change.
Email Trail:
From Max Froumentin
7. the name of the <mark> element seems like an element of type ID.
why not define it as such (see XML 1.0). This would give you the
extra check (from the XML parser) that a name must not appear more
than once.
Proposed disposition: Rejected
We have decided to change the type to xsd:token because we did not want the syntax and uniqueness restrictions imposed by an ID.
It will be defined as a token in the spec.
Email Trail: