W3C

SSML 1.0: Last Call Disposition of Comments

This version:
December 18, 2003
Editor:
Daniel C. Burnett, Nuance

Abstract

This document details the responses made by the Voice Browser Working Group to issues raised during the Last Call (beginning 2 December 2002 and ending 15 January 2003) review of Speech Synthesis Markup Language (SSML) Version 1.0. Comments were provided by Voice Browser Working Group members, other W3C Working Groups, and the public via the www-voice-request@w3.org (archive) mailing list.

Status of this Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This document of the W3C's Voice Browser Working Group describes the disposition of comments as of December 4, 2003 on Speech Synthesis Markup Language (SSML) Version 1.0 Last Call. It may be updated, replaced or rendered obsolete by other W3C documents at any time.

Comments on this document and requests for further information should be sent to the Working Group's public mailing list www-voice@w3.org (archive). Note as a precaution against spam, you should first subscribe to this list by sending an email to <www-voice-request@w3.org> with the word subscribe in the subject line (include the word unsubscribe if you want to unsubscribe).

This document has been produced as part of the W3C Voice Browser Activity, following the procedures set out for the W3C Process. The authors of this document are members of the Voice Browser Working Group (W3C Members only).

Table of Contents


1. Introduction

This document describes the disposition of comments in relation to the Speech Synthesis Markup Language (SSML) Version 1.0 (http://www.w3.org/TR/2002/WD-speech-synthesis-20021202/). Each issue is described by the name of the commenter, a description of the issue, and either the resolution or the reason that the issue was not resolved.

Notation: Each original comment is tracked by a "(Change) Request" [SSCR] designator. Each point within that original comment is identified by a point number. For example, "SSCR5-1" is the first point in the fifth change request for the specification.

2. Comments

Item Commenter Proposed disposition Status
SSCR122-1    Alberto Ciaramella    Rejected    Implicitly accepted   
SSCR122-2    Alberto Ciaramella    Accepted    Implicitly accepted   
SSCR122-3    Alberto Ciaramella    Rejected    Implicitly accepted   
SSCR123-1    Bob Edgar    Rejected    Implicitly accepted   
SSCR124-1    Susan Lesch    Accepted    Implicitly accepted   
SSCR124-2    Susan Lesch    Accepted with changes    Implicitly accepted   
SSCR124-3    Susan Lesch    Accepted    Implicitly accepted   
SSCR125-1    Susan Lesch    Accepted with changes    Implicitly accepted   
SSCR125-2    Susan Lesch    Accepted    Implicitly accepted   
SSCR125-3    Susan Lesch    Accepted    Implicitly accepted   
SSCR125-4    Susan Lesch    Accepted with changes    Implicitly accepted   
SSCR126-1    Alex Monaghan    Rejected    Implicitly accepted   
SSCR126-2    Alex Monaghan    Accepted    Implicitly accepted   
SSCR126-3    Alex Monaghan    Rejected    Implicitly accepted   
SSCR126-4    Alex Monaghan    Accepted with changes    Implicitly accepted   
SSCR126-5    Alex Monaghan    Rejected    Implicitly accepted   
SSCR126-6    Alex Monaghan    Rejected    Implicitly accepted   
SSCR126-7    Alex Monaghan    Rejected    Implicitly accepted   
SSCR126-8    Alex Monaghan    Rejected    Implicitly accepted   
SSCR126-9    Alex Monaghan    Accepted with changes    Implicitly accepted   
SSCR126-10    Alex Monaghan    Rejected    Implicitly accepted   
SSCR126-11    Alex Monaghan        Implicitly accepted   
SSCR126-12    Alex Monaghan    Rejected    Implicitly accepted   
SSCR126-13    Alex Monaghan    Rejected    Implicitly accepted   
SSCR126-14    Alex Monaghan    Rejected    Implicitly accepted   
SSCR127-1    Dave Pawson    Accepted/Rejected    Implicitly accepted   
SSCR127-2    Dave Pawson    Accepted    Implicitly accepted   
SSCR127-3    Dave Pawson    Rejected    Implicitly accepted   
SSCR128-1    Dave Pawson    Rejected    Implicitly accepted   
SSCR128-2    Dave Pawson    Rejected    Implicitly accepted   
SSCR129-1    Dave Pawson    Accepted with changes    Accepted   
SSCR129-2    Dave Pawson    Rejected    Implicitly accepted   
SSCR129-3    Dave Pawson    Rejected    Implicitly accepted   
SSCR129-4    Dave Pawson    Accepted    Implicitly accepted   
SSCR129-5    Dave Pawson    Accepted    Implicitly accepted   
SSCR129-6    Dave Pawson    Accepted    Implicitly accepted   
SSCR130-1    Alex Monaghan    Rejected    Implicitly accepted   
SSCR131-1    Adhemar Vandamme    Rejected    Implicitly accepted   
SSCR132-1    Sobia Mahmud    N/A    Implicitly accepted   
SSCR132-2    Sobia Mahmud    N/A    Implicitly accepted   
SSCR133-1    Max Froumentin    Accepted/Question    Accepted   
SSCR133-2    Max Froumentin    Rejected    Accepted   
SSCR133-3    Max Froumentin    N/A    Accepted   
SSCR133-4    Max Froumentin    Accepted with changes    Accepted   
SSCR133-5    Max Froumentin    Accepted    Accepted   
SSCR133-6    Max Froumentin    Accepted    Accepted   
SSCR133-7    Max Froumentin    Rejected    Accepted   
SSCR133-8    Max Froumentin    Accepted    Accepted   
SSCR133-9    Max Froumentin    Accepted    Accepted   
SSCR133-10    Max Froumentin    N/A    Accepted   
SSCR134-1    Dave Pawson    Rejected    Accepted   
SSCR134-2    Dave Pawson    Accepted/Rejected    Accepted   
SSCR135-1    Dan Brickley    Accepted    Implicitly accepted   
SSCR135-2    Dan Brickley    N/A    Implicitly withdrawn   
SSCR136-1    Susan Lesch    Accepted    Accepted   
SSCR136-2    Susan Lesch    N/A    Implicitly accepted   
SSCR136-3    Susan Lesch    Rejected    Accepted   
SSCR136-4    Susan Lesch    Accepted    Accepted   
SSCR137-1    Dave Pawson    N/A    Accepted   
SSCR138-1    Marc Schroeder    Accepted with changes    Implicitly accepted   
SSCR139-1    Alex Monaghan    Accepted with changes    Implicitly accepted   
SSCR140-1    Andrew Thompson    N/A    Implicitly accepted   
SSCR140-2    Andrew Thompson    Rejected    Implicitly accepted   
SSCR140-3    Andrew Thompson    N/A    Implicitly accepted   
SSCR140-4    Andrew Thompson    Rejected    Implicitly accepted   
SSCR140-5    Andrew Thompson    Accepted/Rejected    Implicitly accepted   
SSCR140-6    Andrew Thompson    Rejected    Implicitly accepted   
SSCR140-7    Andrew Thompson    N/A    Implicitly accepted   
SSCR140-8    Andrew Thompson    N/A    Implicitly accepted   
SSCR141-1    Al Gilman    Rejected    Accepted   
SSCR141-2    Al Gilman    Accepted    Accepted   
SSCR141-3    Al Gilman    Rejected    Accepted   
SSCR141-4    Al Gilman    Accepted with changes    Accepted   
SSCR142-1    WAI-PF    Accepted with changes    Implicitly accepted   
SSCR143-1    Alex Monaghan    Accepted with changes    Implicitly accepted   
SSCR144-1    Richard Schwerdtfeger    Rejected    Accepted   
SSCR145-1    I18N Interest Group    Rejected    Accepted   
SSCR145-2    I18N Interest Group    Rejected    Accepted   
SSCR145-3    I18N Interest Group    n/a    Accepted   
SSCR145-4    I18N Interest Group    N/A    Accepted   
SSCR145-5    I18N Interest Group    Accepted    Accepted   
SSCR145-6    I18N Interest Group    Accepted    Accepted   
SSCR145-7    I18N Interest Group    Accepted    Accepted   
SSCR145-8    I18N Interest Group    Accepted    Accepted   
SSCR145-9    I18N Interest Group    Accepted    Accepted   
SSCR145-10    I18N Interest Group    Accepted    Accepted   
SSCR145-11    I18N Interest Group    Accepted with changes    Accepted   
SSCR145-12    I18N Interest Group    Accepted    Accepted   
SSCR145-13    I18N Interest Group    Accepted    Accepted   
SSCR145-14    I18N Interest Group    Accepted    Accepted   
SSCR145-15    I18N Interest Group    Accepted    Accepted   
SSCR145-16    I18N Interest Group    Accepted    Accepted   
SSCR145-17    I18N Interest Group    Accepted    Accepted   
SSCR145-18    I18N Interest Group    Accepted    Accepted   
SSCR145-19    I18N Interest Group    Accepted with changes    Accepted   
SSCR145-20    I18N Interest Group    Rejected    Accepted   
SSCR145-21    I18N Interest Group    Accepted    Accepted   
SSCR145-22    I18N Interest Group    Accepted with changes    Accepted   
SSCR145-23    I18N Interest Group    Accepted    Accepted   
SSCR145-24    I18N Interest Group    Accepted    Accepted   
SSCR145-25    I18N Interest Group    N/A    Accepted   
SSCR145-26    I18N Interest Group    Accepted    Accepted   
SSCR145-27    I18N Interest Group    Accepted    Accepted   
SSCR145-28    I18N Interest Group    Rejected    Withdrawn   
SSCR145-29    I18N Interest Group    Accepted    Accepted   
SSCR145-30    I18N Interest Group    N/A    Accepted   
SSCR145-31    I18N Interest Group    N/A    Accepted   
SSCR145-32    I18N Interest Group    N/A    Accepted   
SSCR145-33    I18N Interest Group    N/A    Accepted   
SSCR145-34    I18N Interest Group    Rejected    Accepted   
SSCR145-35    I18N Interest Group    Accepted with changes    Accepted   
SSCR145-36    I18N Interest Group    Rejected    Accepted   
SSCR145-37    I18N Interest Group    Rejected    Accepted   
SSCR145-38    I18N Interest Group    Accepted    Accepted   
SSCR145-39    I18N Interest Group    Accepted    Accepted   
SSCR145-40    I18N Interest Group    Accepted    Accepted   
SSCR145-41    I18N Interest Group    Accepted    Accepted   
SSCR145-42    I18N Interest Group    Accepted    Accepted   
SSCR145-43    I18N Interest Group    Accepted    Accepted   
SSCR145-44    I18N Interest Group    Accepted    Accepted   
SSCR145-45    I18N Interest Group    Accepted with changes    Accepted   
SSCR145-46    I18N Interest Group    Accepted/Rejected    Accepted   
SSCR145-47    I18N Interest Group    Rejected    Accepted   
SSCR145-48    I18N Interest Group    Rejected    Accepted   
SSCR145-49    I18N Interest Group    Accepted    Accepted   
SSCR145-50    I18N Interest Group    Accepted    Accepted   
SSCR145-51    I18N Interest Group    N/A    Accepted   
SSCR145-52    I18N Interest Group    Accepted    Accepted   
SSCR145-53    I18N Interest Group    N/A    Accepted   
SSCR145-54    I18N Interest Group    Accepted    Withdrawn   
SSCR145-55    I18N Interest Group    Accepted    Accepted   
SSCR145-56    I18N Interest Group    Accepted    Accepted   
SSCR145-57    I18N Interest Group    Rejected    Accepted   
SSCR145-58    I18N Interest Group    N/A    Withdrawn   
SSCR145-59    I18N Interest Group    Accepted with changes    Accepted   
SSCR145-60    I18N Interest Group    Accepted    Accepted   
SSCR145-61    I18N Interest Group    Accepted    Accepted   
SSCR145-62    I18N Interest Group    Accepted    Accepted   
SSCR145-63    I18N Interest Group    Accepted    Accepted   
SSCR145-64    I18N Interest Group    Accepted    Accepted   
SSCR145-65    I18N Interest Group    Accepted    Accepted   
SSCR145-66    I18N Interest Group    Accepted    Accepted   
SSCR145-67    I18N Interest Group    Rejected    Accepted   
SSCR145-68    I18N Interest Group    Accepted with changes    Accepted   
SSCR145-69    I18N Interest Group    N/A    Accepted   
SSCR145-70    I18N Interest Group    Accepted    Accepted   
SSCR145-71    I18N Interest Group    Accepted    Accepted   
SSCR145-72    I18N Interest Group    Accepted    Accepted   
SSCR145-73    I18N Interest Group    Accepted    Accepted   
SSCR145-74    I18N Interest Group    n/a    Accepted   
SSCR145-75    I18N Interest Group    Accepted    Accepted   
SSCR145-76    I18N Interest Group    Accepted    Accepted   
SSCR145-77    I18N Interest Group    Accepted    Accepted   
SSCR145-78    I18N Interest Group    Accepted    Accepted   
SSCR145-79    I18N Interest Group    Accepted    Accepted   
SSCR145-80    I18N Interest Group    Accepted    Accepted   
SSCR145-81    I18N Interest Group    Accepted with changes    Accepted   
SSCR145-82    I18N Interest Group    Accepted    Accepted   
SSCR145-83    I18N Interest Group    N/A    Accepted   
SSCR145-84    I18N Interest Group    Accepted    Accepted   
SSCR145-85    I18N Interest Group    N/A    Accepted   
SSCR145-86    I18N Interest Group    Accepted    Accepted   
SSCR145-87    I18N Interest Group    Accepted    Accepted   
SSCR145-88    I18N Interest Group    Accepted    Accepted   
SSCR145-89    I18N Interest Group    Rejected    Accepted   
SSCR146-1    XML Schema WG    Accepted/Rejected    Implicitly accepted   
SSCR151-1    Dave Pawson    Rejected    Accepted   

Issue SSCR122-1

From Alberto Ciaramella

Here follow my comments about the Speech Synthesis mark up language specification of the Speech Interface Framework, draft dated 3 january 20001.

- 1) paragraph 1.2. It shows that the processing, in different stages, is influenced both by the mark up support and by not mark-up behaviour. I suggest to add here as a general rule that "explicit mark up always takes the precedence over not-mark up behaviour". This kind of rule in the present version of the document is presented as the usage note 2 at the end of 2.4, but it is definitely more general than this.

Proposed disposition: Rejected

Behavior in the specification is determined on an element-by-element basis because the markup in some cases might try to do something which the engine knows to be inappropriate. As an example, a prosody contour with sequential pitch targets that vary wildly will not be observed very closely by any commercial engine because the audio would be exceedingly unnatural and likely unintelligible. Additionally, requiring the markup behavior to take precedence would be difficult to enforce without audio checks that measure not just conformance, but performance. We do not believe it is appropriate for the specification to render too fine an opinion on performance.

Email Trail:



Issue SSCR122-2

From Alberto Ciaramella

Here follow my comments about the Speech Synthesis mark up language specification of the Speech Interface Framework, draft dated 3 january 20001.

- 2) paragraph 1.2 point 6: waveform production mark up support. I do not agree that "the TTS markup does not provide explicit controls over the generation of the waveforms". In fact with mark-ups already introduced in point 5 for controlling the prosody you can control both the volume and the speed.

Proposed disposition: Accepted

We will remove this sentence.

Email Trail:



Issue SSCR122-3

From Alberto Ciaramella

Here follow my comments about the Speech Synthesis mark up language specification of the Speech Interface Framework, draft dated 3 january 20001.

- 3) Other than this [comment 122-2], always in paragraph 1.2 point 6, it is advisable to identify if a sentence can or can not be interrupted by a barge-in. This feature is present in section 4.1.5 of the document "Voice Extensible mark up language", version 2. Thus poses another more general issue: what is the relationship between the document "Speech Synthesis mark up language" and the chapter 4 (System Output) of the Voice XML version 2. It must be explicitated, taking care not to duplicate the definitions between these documents in order to simplify the document maintenance.

Proposed disposition: Rejected

We believe this comment has been addressed by changes to both SSML and VoiceXML. Although examples of SSML embedded in other languages are appropriate for this document, specific details are not. Barge-in behavior, for example, is outside the scope of this specification.

Email Trail:



Issue SSCR123-1

From Bob Edgar

Comment on:
Speech Synthesis Markup Language Specification for the Speech Interface Framework
W3C Working Draft 3 January 2001
http://www.w3.org/TR/2001/WD-speech-synthesis-20010103

DTD gives "number:ordinal" and "number:digits" as valid say-as types, but not just "number", however the example in 2.4 has <say-as type="number">. According to my reading of the XML specification, and also according to Microsoft's validating parser, this is not allowed by the DTD, you would have to explicitly allow "number" -- there is no rule that says you can match a prefix of the attribute value. The same issue applies to date, time and duration types.

Bob.

Proposed disposition: Rejected

We agree there is an error. We rejected this request only because the <say-as> element no longer has this level of detail.

Email Trail:



Issue SSCR124-1

From Susan Lesch

Just a few comments about the speech grammar and synthesis Last Call Working Drafts [1,2].

[1] http://www.w3.org/TR/2001/WD-speech-grammar-20010103/
[2] http://www.w3.org/TR/2001/WD-speech-synthesis-20010103/

The markup, embedded CSS, and overall presentation is well done and easy to follow. One suggestion is to mark up elements and attributes as XHTML code, for example, <code>item</code>, rather than quote them as "item". No added color would be necessary.

Proposed disposition: Accepted

We will mark up SSML elements and attributes along the lines of the approach used in the XHTML2 Working Drafts.

Email Trail:



Issue SSCR124-2

From Susan Lesch

Just a few comments about the speech grammar and synthesis Last Call Working Drafts [1,2].

[1] http://www.w3.org/TR/2001/WD-speech-grammar-20010103/
[2] http://www.w3.org/TR/2001/WD-speech-synthesis-20010103/

(2) Regarding the embedded style, I have been told that hex values are better supported in old browsers than RGB. In Netscape 4.x Mac, "list-style: none;" renders as a question mark for each bullet and should be omitted. In IE 3.x Mac, "background" color is supported; ("background-color" is not). Also, when declaring a background color, a text color is needed; #000 would be fine.

Proposed disposition: Accepted with changes

Your point about cross-browser styling is a good one. From the next draft of the specification onwards we will be following Dave Raggett's style guide as much as possible.

Email Trail:



Issue SSCR124-3

From Susan Lesch

Just a few comments about the speech grammar and synthesis Last Call Working Drafts [1,2].

[1] http://www.w3.org/TR/2001/WD-speech-grammar-20010103/
[2] http://www.w3.org/TR/2001/WD-speech-synthesis-20010103/

(3) Finally, both Working Drafts have extensive "future study" sections. When and if the drafts move to higher maturity levels, I think these sections should be cut and moved to another location, possibly linked from the Voice Browser Activity home page. Somehow speculating would seem out of place in a Recommendation. What do you think?

Proposed disposition: Accepted

This section has already been significantly reduced in the last draft. We agree that it should be removed entirely as the document moves closer to Recommendation.

Email Trail:



Issue SSCR125-1

From Susan Lesch

(1) The title "Speech Synthesis Markup Language Specification for the
Speech Interface Framework" is quite long. How does "Speech
Synthesis Markup Language (SSML) Specification Version 1.0" sound?

Proposed disposition: Accepted with changes

We have accepted this with a slight modification: Speech Synthesis Markup Language Version 1.0.

Email Trail:



Issue SSCR125-2

From Susan Lesch

(2) The references need development and could work without hopping to
other specs. In 1., JSML can be a link [JSML] to your normative
references in section 7. In 1.3, 2.8, and 2.9 there can be a link
[CSS2]. In 2.2, I would use XML 1.0 section 2.12 as the authority
with a [XML] link to your normative references.

Proposed disposition: Accepted

All non-SSML links within the document now point to an item in the References section which itself provides the official external reference and link.

Email Trail:



Issue SSCR125-3

From Susan Lesch

(3) These documents could all be normative references rather than links
in the running text; I may have missed some:
Speech Synthesis Markup Requirements for Voice Markup Languages
Cascading Style Sheets, level 2
RFC 1766
Extensible Markup Language (XML) 1.0 (Second Edition)
Namespaces in XML
International Phonetic Alphabet
ASCII Phonetic Symbols for the World's Languages: Worldbet
Computer-coding the IPA: a proposed extension of SAMPA

A good model is HTML at http://www.w3.org/TR/html401/references.html.
(The definition terms like [CSS2] can be black.)

Proposed disposition: Accepted

All non-SSML links within the document now point to an item in the References section which itself provides the official external reference and link.

Email Trail:



Issue SSCR125-4

From Susan Lesch

(4) Below, a section number is followed by a quote and then a suggestion.

Abstract
working group
Working Group
[For W3C entities, you can copy capitalization conventions from the
W3C Process document at http://www.w3.org/Consortium/Process/.]

web [twice]
Web

and etc.
etc.

1.
Standard
standard

1. par. 2
web
Web

and etc.
etc.

1.1 list item 2
Audio Cascading Style Sheets
aural Cascading Style Sheets

In 1.2, the items with hyphens for bullets could be unordered lists.
Also, in list item 5, "- Markup support" is br-separated unlike the
others.

1.2 list item 4
names; e.g.
names, e.g.

1.3 list 1 item 3
as many details
[does this mean "many details" or "as many details as are necessary"?]

1.3 list 2 item 2
Interoperability with Aural CSS:
Interoperability with aural CSS (ACSS):

Aural CSS-enabled
aural CSS-enabled

1.3 list 2 item 3
style-sheet [twice]
style sheet

2.3 last par.
an enclosing paragraph or sentence elements
an enclosing paragraph or sentence element

2.4 Time, Date and Measure Types last par.
separated by single, non-alphanumeric character.
separated by a single, non-alphanumeric character.

2.4 Address, Name, Net Types list item 2
internet
Internet

2.4 third example
acme.com is a registered domain. W3C recommends using example.com,
example.org, or example.net which IANA has reserved for examples.
Please see RFC 2606 section 3 at http://www.ietf.org/rfc/rfc2606.txt.

2.5 list item 2
Postscript
PostScript

2.6 list item 2 needs an ending period.

2.8 list item 2 and 2.9 duration
It follows the "Times" attribute format from the
Cascading Style Sheet Specification. e.g. "250ms", "3s".
could read [five changes here]:
It follows the "time" attribute format from the
Cascading Style Sheet Level 2 Recommendation [CSS2],
e.g. "250ms", "3s".

2.9 Relative values
SSML
[The acronym would work fine throughout the spec. I would use it
in parentheses after the first occurrence of "Speech Synthesis
Markup Language" in the Abstract or section 1 and thereafter.]

2.9 Pitch contour
attribute (absolute,
attribute; (absolute,

2.9 last word
minute.)
minute).

2.10 par. 1
mime-type
MIME type

3.1
Lernout and Hauspie Speech Products
Lernout & Hauspie Speech Products

3.3
dialog markup language
[Is this "Dialog Markup Language"?]

3.6 par. 1
string (markup
string; (markup

4.
Informative.
informative.

4. second example
Lee
Berners-Lee

<audio src="http://www.w3c.org/music.wav">
<audio src="http://w3c.example.org/music.wav">

5.
Normative.
normative.

The second paragraph needs an ending period.

5.1 list item 1
(relative to XML) is well-formed.
is well-formed XML [XML section 2.1]

5.1 list item 2
is a valid XML document
is a valid XML document [XML section 2.8]

5.3 par. 2
XML Namespaces.
Namespaces in XML.

6.
Normative.
normative.

7. Informative
http://www.voicexml.com/ [twice]
http://www.voicexml.org/

Proposed disposition: Accepted with changes

Most of your changes were accepted verbatim. For the remaining cases the problem you implied was corrected via other text.

Email Trail:



Issue SSCR126-1

From Alex Monaghan

It is not clear who the intended users of this markup language are. There are two obvious types of possible users: speech synthesis system developers, and application developers. The former may well be concerned with low-level details of timing, pitch and pronunciation, and be able to specify these details (F0 targets, phonetic transcriptions, pause durations, etc.). The latter group are much more likely to be concerned with specifying higher-level notions such as levels of boundary, degrees of emphasis, fast vs slow speech rate, and formal vs casual pronunciation. The proposal appears to be aimed at both groups, but no indication is given as to which aspects of the markup language are intended for which group.

Distinguish clearly between tags intended for speech synthesis developers and tags intended for application designers.

Proposed disposition: Rejected

We believe that all the tags are appropriate for and needed by application developers. Commercial deployments of SSML so far appear to have borne out this conclusion.

Email Trail:



Issue SSCR126-2

From Alex Monaghan

It is clear that the proposal includes two, and in some cases three, different levels of markup. For F0, for instance, there is the <emphasis> tag (which would be realised as a pitch excursion in most systems), the <prosody contour> tag which allows finer control, and the low-level <pitch> tag which is proposed as a future extension. There is very little indication of best practice in the use of these different levels (e.g. which type of user should use which level), and no explanation of what should happen if the different levels are combined (e.g. a <pitch contour> specification inside an <emphasis>environment).

Clarify the intended resolution of conflicts between high-level and low-level markup, or explain the dangers of using both types in the same document. This would be simpler if there were two distinct levels of markup.

Proposed disposition: Accepted

This is an excellent point. We will note the dangers as you suggest. We will also note that although the behaviors of the individual elements are specified, details about how conflicts are resolved are implementation specific.

Email Trail:



Issue SSCR126-3

From Alex Monaghan

We strongly suggest that some distinction between high-level markup (specifying the function or structure of the input) and low-level markup (specifying the form of the output) be introduced, ideally by providing two explicit markup sublanguages. The users of these sublanguages are unlikely to overlap. Moreover, while most synthesisers might support one level of markup or the other, there are currently very few synthesisers which could support both.

Perhaps two separate markup languages (high-level and low-level) should be specified. This would have the desirable side-effect of allowing a synthesiser to comply with only one level of markup, depending on the intended users.

Proposed disposition: Rejected

There are certainly complete implementations of SSML today that implement both high and low level tags. This separation is something we will consider for a later version of SSML (beyond 1.0). For this specification we will add a note that although the tags themselves may be supported, details of the interactions between the two levels are implementation specific. We will encourage developers to use caution in mixing them arbitrarily.

Email Trail:



Issue SSCR126-4

From Alex Monaghan

The notion of "non-markup behavior" is confusing. On the one hand, there seems to be an assumption that markup will not affect the behaviour of the system outside the tags, and that the markup therefore complements the system's unmarked performance, but on the other hand there are references to "over-riding" the system's default behaviour. In general, it is unclear whether markup is intended to be superimposed on the default behaviour or to provide information which modifies that behaviour. The use of the <break> element, for instance, is apparently intended "to override the typical automatic behavior", but the insertion of a <break> tag may have non-local repercussions which are very hard to predict. Take a system which assigns prosodic boundaries stochastically, and attempts to balance the number and length of units at each prosodic level. The "non-markup behavior" of such a system might take the input "Big fat cigars, lots of money." and produce two balanced units: but will the input "Big fat <break/> cigars, lots of money." produce three unbalanced units (big fat, cigars, lots of money), or three more balanced units (big fat, cigars lots, of money), or four balanced units (big fat, cigars, lots of, money), or six single-word units, or something else? Which would be the correct interpretation of the markup?

Clarify the intended effect of tags on the default behaviour of synthesis systems. Should they be processed BEFORE the system performs its "non-markup behavior", or AFTER the default output has been calculated? Does this vary depending on the tag? Again, this may be resolved by introducing two distinct levels of markup.

Proposed disposition: Accepted with changes

This is a good point. As you surmised, the behavior does vary depending on the tag, largely because the processor has the ultimate authority to ensure that what it produces is pronounceable (and ideally intelligible). In general the markup provides a way for the author to make prosodic and other information available to the processor, typically information the processor would be unable to acquire on its own. It is up to the processor to determine whether and in what way to use the information.
We will provide some additional text to clarify this behavior.

Email Trail:



Issue SSCR126-5

From Alex Monaghan

Many of the tags related to F0 presuppose that pitch is represented as a linear sequence of targets. This is the case for some synthesisers, particularly those using theories of intonation based on the work of Bruce, Ladd or Pierrehumbert. However, tlihe equally well-known Fujisaki approach is also commonly used in synthesis systems, as are techniques involving the concatenation of natural or stylised F0 contours: in these approaches, notions such as pitch targets, baselines and ranges have very different meanings and in some cases no meaning at all. The current proposal is thus far from theory-neutral, and is not implementable in many current synthesisers.

Revise the F0 tags to allow for theory-neutral interpretation: if this is not done, the goal of interoperability across synthesis platforms cannot be achieved.

Proposed disposition: Rejected

It is outside the scope of this group to design a theory-neutral approach. We are not aware of the existence of such an approach, and so far in commercial systems we have seen considerable support for the current approach. There is also no requirement within the specification that any of the theories you mention be used in implementation. Rather, F0 variation is expressed in terms of pitch targets but can be mapped into any underlying model the processor wishes.

Email Trail:



Issue SSCR126-6

From Alex Monaghan

There is no provision for local or language-specific additions, such as different classes of abbreviations (e.g. the distinction between a true acronym such as DEC and an abbreviation such as NEC), different types of numbers (animate versus inanimate in many languages), or the prosodic systems of tone languages. Some specific examples are discussed below, but provision for anything other than English is minimal in the current proposal. As compliant systems extend their language coverage, they should be able to add the required markup in a standard way, even if it has not been foreseen by the W3C.

Provide a mechanism for extending the standard to include unforeseen cases, particularly language-specific or multilingual requirements.

Proposed disposition: Rejected

It is difficult, if not impossible, to incorporate a generic mechanism that will work for all of the language features you're describing, in addition to unforseen features, in a standard manner. It may be possible to have extensions to the specification later on as we discover standardized ways to provide the information you suggest. We welcome your input for such future extensions.

Email Trail:



Issue SSCR126-7

From Alex Monaghan

<say-as>: Several categories could be added to this tag, including credit card numbers (normally read in groups) and the distinction between acronyms (DEC, DARPA, NASA) and letter-by-letter abbreviations (USA, IBM, UK).

Add the categories mentioned above.

Proposed disposition: Rejected

These are good suggestions. However, we have removed all attribute values and their definitions from the <say-as> element. To avoid inappropriate assumptions about what is specified, we will also be removing the examples from the <say-as> section. We expect to begin work on specifying the details of the <say-as> element when SSML 1.0 reaches the Candidate Recommendation stage. We will consider your suggestions at that time.

Email Trail:



Issue SSCR126-8

From Alex Monaghan

In languages with well-developed morphology, such as Finnish or Spanish, the pronunciation of numbers and abbreviations depends not only on whether they are ordinal or cardinal but also on their gender, case and even semantic properties. These are often not explicit, or even predictable, from the text. It would be advisable to extend the <sayas> tag to include an optional attribute to hold such information.

Proposed disposition: Rejected

We are aware of this issue and have considered it again in response to your input, but we are not prepared to address it at this time. As you point out, there is broad variability in the categories and structure of this information. The <say-as> element is only designed to indicate simple structure for cases where the synthesis processor is unable to determine it on its own. Where large amounts of context-dependent information would be required in order to adequately inform the processor, we would recommend not using the <say-as> element at all. Rather, we recommend that numbers and abbreviations be instead written out orthographically, as is possible with any text over which the application writer wishes absolute control.

Email Trail:



Issue SSCR126-9

From Alex Monaghan

<voice> element: It seems unnecessary to reset all prosodic aspects to their defaults when the voice changes. This prevents the natural-sounding incorporation of direct speech using a different voice, and also makes the reading of bilingual texts (common in Switzerland, Eastern Europe, the Southern USA, and other exotic places) very awkward. Although absolute values cannot be carried over from voice to voice, it should be possible to transfer relative values (slow/fast, high/medium/low, etc.) quite easily.

Allow the option of retaining relative prosodic attributes (pitch, rate, etc.) when the voice is changed.

Proposed disposition: Accepted with changes

We agree in principle with your suggestion. We will remove the contentious paragraph and replace it with one explaining that
  • relative changes in prosodic parameters are expected to be carried across voice changes, but
  • different voices have different natural defaults for pitch, speaking rate, etc. because they represent different personalities, so
  • absolute values of the prosodic parameters may vary across changes in the voice.

Email Trail:



Issue SSCR126-10

From Alex Monaghan

<break> element: Some languages have a need for more levels of prosodic boundary below a minor pause, and some applications may require boundaries above the paragraph level. It would be advisable to add an optional "special" value for these cases.

Add an optional "special" attribute to allow language-specific and application-specific extensions.

Proposed disposition: Rejected

This is a good suggestion, but it is too extensive to add to the specification at this time. This feature will be deferred to the next version of SSML.

Email Trail:



Issue SSCR126-11

From Alex Monaghan

<prosody> element: There is currently no provision for languages with lexical tone. These include many commercially important languages (e.g. Chinese, Swedish, Norwegian), as well as most of the other languages of the world. Although tone can be specified in a full IPA transcription, the ability to specify tone alongside the orthography would be very useful.

Add an optional "tone" attribute.

Proposed disposition:

It is unclear how you would expect this to work. As you point out, this can be specified in full IPA, which is possible with the phoneme element today.
How would you envision specifying tone *alongside* the orthography?

Email Trail:



Issue SSCR126-12

From Alex Monaghan

<rate> element: There is currently no unit of measurement for this tag. The "Words per minute" values suggested in the previous draft were at least a readily understandable measure of approximate speech rate. If their approximate nature were made explicit, these could function as indicative values and would be implementable in all synthesisers.

Proposed disposition: Rejected

Because of the difficulty in accurately defining the meaning of words per minute, syllables per minute, or phonemes per minute across all possible languages, we have decided to replace such specification with a number that acts as a multiplier of the default rate. For example, a value of 1 means a speaking rate equal to the default rate, a value of 2 means a speaking rate twice the default rate, and a value of 0.5 means a speaking rate of half the default rate. The default rate is processor-specific and will usually vary across both languages and voices. Percentage changes relative to the current rate are still permitted. Note that the effect of setting a specific words per minute rate (for languages for which that makes sense) can be achieved by explicitly setting the duration for the contained text via the duration attribute of the <prosody> element. The duration attribute can be used in this way for all languages and is therefore the preferred way of precisely controlling the rate of speech when that is desired.

Email Trail:



Issue SSCR126-13

From Alex Monaghan

<rate> element: It is equally important to be able to specify the dynamics of speech rate - accelerations, decelerations, constancies. These are not mentioned in the current proposal.

Proposed disposition: Rejected

These are good suggestions, but they are too extensive to add to the specification at this time. These features will be deferred to the next version of SSML.

Email Trail:



Issue SSCR126-14

From Alex Monaghan

<audio> element: Multimodal systems (e.g. animations) are likely to require precise synchronisation of audio, images and other resources. This may be beyond the scope of the proposed standard, but could be included in the <lowlevel> tag.

Consider a <lowlevel> extension to allow synchronisation of speech with other resources.

Proposed disposition: Rejected

As you suggest, this class of additions is outside the scope of the specification. We think it likely that other specifications such as SMIL would be more appropriate for this functionality. To the best of our knowledge, there are no major technical problems with integration of SMIL and SSML functionality.

Email Trail:



Issue SSCR127-1

From Dave Pawson

Firstly a little backgound.
We have been using Text to Speech for about 18 months,
to produce alternative media for visually impaired customers.
We have learned over that time just what type of material
is suitable.
Our needs are:
XML source.
Ability to insert external audio files into the audio stream
(audible navigation points, tone bursts at 55 hz which are
findable when tape is played fast forward).
Ability to add to a dictionary / word set those words which
the synth gets wrong.
Ability to id and have spoken correctly standard items such
as dates, acronyms etc.

Proposed disposition: Accepted/Rejected

SSML 1.0 is based on XML.
It is possible to insert external audio files into the audio stream using the <audio> element.
It is possible, via the <lexicon> element, to add to a lexicon those words which the synth gets wrong.
We have removed the specification for interpretation hints for dates, etc. (part of the <say-as> element) but intend to reactivate that work as a separate activity when SSML 1.0 reaches the Candidate Recommendation stage. We will consider your suggestion "Ability to id and have spoken correctly standard items such as dates, acronyms etc." at that time.

Email Trail:



Issue SSCR127-2

From Dave Pawson

We use silences to good effect, as user research has shown.
I'd love to see <break time="2S"/>

Proposed disposition: Accepted

This capability is in the most recent draft of the specification.

Email Trail:



Issue SSCR127-3

From Dave Pawson

[Ed: this is with regard to the <prosody> rate attribute]
Provide a rate of 1 to 100, let the synth people interpret that for their engines, and users select appropriately by experiment.

Proposed disposition: Rejected

Because of the difficulty in accurately defining the meaning of words per minute, syllables per minute, or phonemes per minute across all possible languages, we have decided to replace such specification with a number that acts as a multiplier of the default rate. For example, a value of 1 means a speaking rate equal to the default rate, a value of 2 means a speaking rate twice the default rate, and a value of 0.5 means a speaking rate of half the default rate. The default rate is processor-specific and will usually vary across both languages and voices. Percentage changes relative to the current rate are still permitted. Note that the effect of setting a specific words per minute rate (for languages for which that makes sense) can be achieved by explicitly setting the duration for the contained text via the duration attribute of the <prosody> element. The duration attribute can be used in this way for all languages and is therefore the preferred way of precisely controlling the rate of speech when that is desired.
This approach differs notably from your suggestion in that there is no maximum rate value. If this particular feature (maximum rate value) is important for you, could you provide some sample use cases?

Email Trail:



Issue SSCR128-1

From Dave Pawson

A command line application.
application xml-file, output audio file.
(rather than being buried in some app I haven't a clue about).
? Off topic I suppose :-)

Proposed disposition: Rejected

This is out of scope for the language, although nothing prevents the use of an SSML processor to send its generated audio stream to a file.

Email Trail:



Issue SSCR128-2

From Dave Pawson

2. That the tts system be able to analyse a document and tell me that it has never heard of word xxxx, rather than creating a 40 minute recording, only to find that after 35 minutes, a word has to be added to its dictionary and the whole job done again. Thats a real bummer.

Even if it pulled them, pronounced each one so that we could check that it was a good guess/risible would help.

Or it pulled them, marked them up and exported them as XML for that level of checking!

Proposed disposition: Rejected

These suggestions are all for the processor and not the language. As such, they are out of scope.

Email Trail:



Issue SSCR129-1

From Dave Pawson

1.2, list item 4, para 3.
"TTS systems are expert at performing text-topohoneme conversions
so most words of most documents can be handled automatically".
Rather too sweeping for my liking. Certainly not the case for
the systems I've seen :-)

Analysis:
The VBWG asked for a specific suggestion, which Dave then provided.

Proposed disposition: Accepted with changes

We believe there is a misunderstanding that is simple to correct. There is already an ability in the specification to adjust pronunciation both internally via the phoneme element and externally via a lexicon. We agree there are times when one needs a lexicon. By placing better pronunciations for words in an external lexicon, the processor will automatically use the values in the lexicon over its own defaults without any additional markup (except for the single use of the <lexicon> element at the top of the document that points to the lexicon definition file). We also agree that the specification wording you quote unintentionally implies a claim about the quality of today's synthesis technology. To correct this, we will change "are expert at performing" to "are designed to perform".

Email Trail:



Issue SSCR129-2

From Dave Pawson

2.4 Sub attribute.

A nice feature for a user would be to permit these to be collated
externally, and passed in as a sort of configuration file.

It would save typing for regularly repeated occurrences.
<sub>
<el>W3C</el>
<use>World Wide Web Consortium</use>
</sub>

or something similar?

Proposed disposition: Rejected

This request is similar to some earlier work by the Voice Browser Working Group on a standardized lexicon format (containing pronunciations for tokens and phrases). Your request is one that might best be considered for that effort if and when it re-activates. We encourage you to resubmit this request to the Working Group at that time.

Email Trail:



Issue SSCR129-3

From Dave Pawson

2.8 break element.
A refinement on this would be the ability to explicitly state
the required duration for various punctuation elements and other
break types (paragraph, sentance).

Again suggest this be externally configurable, for re-use optimisation.

Proposed disposition: Rejected

This concept has been considered but rejected as part of SSML 1.0. Rather, we encourage the use of style sheets or transformations to enable this macro-like behavior. It is possible that future versions of SSML beyond version 1.0 could permit default value setting for items such as paragraph and sentence prosody, but this kind of manipulation today is discouraged by most commercial synthesis engine developers on anything other than the occasional basis enabled by the <break> element.

Email Trail:



Issue SSCR129-4

From Dave Pawson

2.10 Usage note 1. Could be confusion between this and 3.2.
If the default is to pause conversion till the audio is complete,
then it should be explicitly stated here. I support that requirement btw.

Proposed disposition: Accepted

We have removed the Future Study text from the document. Playback of recorded audio occurs in sequence with preceding and following synthesis, matching what you prefer. To obtain background playing, mixing, etc. we would recommend using SMIL.

Email Trail:



Issue SSCR129-5

From Dave Pawson

2.12 usage note. Why hasn't a namespace been explicitly called up?
This would then nullify the requirement stated in 5.1 (I can't see
any need for that requirement. Is it justifiable?)

Proposed disposition: Accepted

The most recent draft of the specification contains a namespace definition and more careful conformance language with respect to non-standard extensions.

Email Trail:



Issue SSCR129-6

From Dave Pawson

3.6 Value.

I suspect that the overall impact of this may be achieved by a simple
XSLT transform anyway, which may make this redundant?

Proposed disposition: Accepted

We have removed this text from the specification. Such functionality is expected to be achieved through the use of style sheets (ACSS/XSLT), as you suggest.

Email Trail:



Issue SSCR130-1

From Alex Monaghan

Subject: SSML specification: <say-as type="date:dml">

i don't think anyone has mentioned this before - apologies if i'm wrong! for some reason, "dm" has been omitted as a format value for the "date" value of the "type" attribute of the "say-as" element. still with me? this is probably the most common way of writing a date in most european languages: a meeting on 13/5, a conference from 30/11 to 3/12, etc. i assume there's no good reason not to include it - can it be added, please?

Proposed disposition: Rejected

This value has been discussed within the Working Group. It will be considered when the group resumes definition of the values for the <say-as> attributes, currently planned to begin when the SSML 1.0 specification reaches the Candidate Recommendation stage.

Email Trail:



Issue SSCR131-1

From Adhemar Vandamme

Why does the SSML need a mark-tag with name-attribute to place a marker
into the text/tag sequence and contain text that is used to reference a
special sequence of tags and text, either for internal reference within
the SSML document, or externally by another document?

Can some-one explain to me why this can't be done with an id-attribute in
an arbitrary tag, like in many other XML specifications (e.g. XHTML)?

If no the text that should be referenced is not enclosed in a tag yet, I
suggest using a span-tag, for consistency with XHTML.

I give an example:

<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
 xml:lang="en-US">
   <p id="marks">
      <s>
         We would like
         <span id="congrats">
            to extend our warmest congratulations
         </span>
         to the members of the Voice Browser Working Group!
      </s>
      <s id="really">
         Really, we would.
      </s>
   </p>
   <p>
      <s>
         Go from <span id="here" /> here, to <span id="there" /> there!
      </s>
   <p>
</speak>

Herein, a full paragraph, a part of a sentence, a full sentence and two
"moments" are marked, using an id-attribute in a p-tag or an s-tag when
available, and in a span-tag otherwise.

Proposed disposition: Rejected

Upon consideration it has become even clearer to us that mark labels should not even be xsd:id's. As an example, the uniqueness constraints of ID's are a hindrance rather than a benefit, e.g. it may well be desirable to repeatedly use the same mark label (equivalent to repeatedly sending back the same event). We also wish to permit integer labels, for example. Because of this desire to have fewer restrictions than those introduced by ID's, we have decided to change the name attribute to be of type xsd:token.

Email Trail:



Issue SSCR132-1

From Sobia Mahmud

1. Firstly in the Voice tag , what does the Name and the variant attributes do. Are they essential part of the voice tag?

Proposed disposition: N/A

The name and variant attributes are indeed essential. There are many criteria that can be used to select a specific voice, more than we wish to standardize. For application writers who have precise requirements as to the voices to use, the name and variant attributes provide an ability to be as precise as the underlying processor requires. As an example, one might use the name attribute to select a specific named voice, e.g. "BobSmith1".

Email Trail:



Issue SSCR132-2

From Sobia Mahmud

2. In the prosody tag, it is explained that pitch contour comprises of the interval and the target. What input is user expected to provide in order to define the pitch contour?

Proposed disposition: N/A

The subsection "Pitch Contour" in Section 2.2.4 describes the format for contour specifications. Is there a particular part of this description that you find incomplete?

Email Trail:



Issue SSCR133-1

From Max Froumentin

1. Why is the xml declaration mandatory? This goes against the XML
conformance rules, and it means that a standard XML parser could
not be used as it would accept the absence of a declaration. Since
this is mentioned twice, I imagine that the WG had a good reason
to do so, and it would be nice to find why in the spec.

1.5. Similarly, why is the SSML namespace declaration mandatory?

1.6 Section 3.1 seems to mandate the use of xsi as the prefix of
schemaLocation.

Proposed disposition: Accepted/Question

1. The xml declaration is not intended to be mandatory. We will correct the error.
1.5 We do not understand your concern with the SSML namespace declaration. Can you elaborate?
1.6 The section 3.1 text regarding the prefix for schemaLocation will be changed to permit any prefix to be defined for the Schema schema.

Email Trail:



Issue SSCR133-2

From Max Froumentin

2. Why do all the examples link to the schema? It makes them
less easy to read, and gives the impression that schemaLocation
is mandatory.

Proposed disposition: Rejected

We have received comments from other reviewers that our examples should be complete stand-alone documents. As a result, the Voice Browser Working Group has taken the following position with respect to all of its specifications:
We recommend, but do not require, the use of schema. For that reason, our examples all contain references to the SSML schema.
We will clarify this in the specification.

Email Trail:



Issue SSCR133-3

From Max Froumentin

3. I have trouble understanding this, in 2.1.5: "It is an error if a
value for alphabet is specified that is not known or cannot be
applied by an SSML processor.", where "error" is defined as a violation
of the spec.

The test above indicates that values other than 'ipa' are allowed
for alphabet, so this would mean that if a processor doesn't
understand the value "xyz" (which a SSML producer has just come up
with), then the processor violates the spec?

Analysis:
VBWG responded that the SSML document violates the spec in this case and that error reporting is not required.
Max replied:

So an SSML document can violate the specification because it has a value for 'alphabet' that is not given by a given processor but works in another? I would instead say that a conformant processor may not support a given alphabet but must report an error. Maybe the QA people at W3C could help clarify

Note that in 2.1.2 'conformant' is used. It should be 'conforming'.

Proposed disposition: N/A

We propose to make the following changes:
  • drop the text ", where "error" is defined as a violation of the spec." in 2.1.5 and just make sure that the word "error" earlier in the sentence is linked to our definition of error
  • remove the text "A violation of the rules of this specification;" in the definition of error in section 1.5.

Email Trail:



Issue SSCR133-4

From Max Froumentin

4. in 2.2.1, the age attribute is defined as being of type "integer".
that should be positive integer.

The style used for '(integer)' seems to indicate a formal reference
to a type. If it were, this would be more accurately described as
XML Schema's nonNegativeInteger. Ditto for the variant attribute which would
have to refer to xsd:integer

Analysis:
The VBWG agreed that type definitions should be more precise.
Max rejected the response as being inadequate.

Proposed disposition: Accepted with changes

We accept the point you're making and will add some text that more precisely identifies the type, although it may not be exactly the text you gave.

Email Trail:



Issue SSCR133-5

From Max Froumentin

5. "Durations follow the "Times" attribute format from the [CSS2]
specification". I think this should be phrased as: "Durations
follow the <time> basic data type from the [CSS2] specification".

Proposed disposition: Accepted

We will correct this.

Email Trail:



Issue SSCR133-6

From Max Froumentin

6. The definition of number in 2.2.4
"A number is a simple floating point value without exponentials."
insert 'positive'. (sorry to be pedantic ;-)

Proposed disposition: Accepted

We will make this change.

Email Trail:



Issue SSCR133-7

From Max Froumentin

7. the name of the <mark> element seems like an element of type ID.
why not define it as such (see XML 1.0). This would give you the
extra check (from the XML parser) that a name must not appear more
than once.

Proposed disposition: Rejected

We have decided to change the type to xsd:token because we did not want the syntax and uniqueness restrictions imposed by an ID.
It will be defined as a token in the spec.

Email Trail:



Issue SSCR133-8

From Max Froumentin

8. desc seems to be the only element where no examples are shown.

Proposed disposition: Accepted

We will add an example.

Email Trail:



Issue SSCR133-9

From Max Froumentin

9. the 5th paragraph of 3.1 "It is recommended ..." ends with a ':'

Proposed disposition: Accepted

We will correct this.

Email Trail:



Issue SSCR133-10

From Max Froumentin

10. Stand-Alone documents. What is the difference between that
and xml standalone documents?

Analysis:
The VBWG answered Max's question.
Max would like this distinction made plainer in the specification and would also like a clearer specification of what a non-stand-alone ssml document is.

Proposed disposition: N/A

We will remove the line requiring synthesis fragments to be well-formed XML documents.

Email Trail:



Issue SSCR134-1

From Dave Pawson

1. Generating a silence.
Although this could be done using an 'empty' external file,
The cleanliness of the generated silence is rarely as good as
an automatically generated one.
Rationale: For later, automatic processing of synthetic speech,
usually for alignment with text.

Analysis:
Dave replied to our disposition. It is our belief that his conditions for acceptance are met by the break element as we envision it.

Proposed disposition: Rejected

If we understand your suggestion correctly, this capability is already present in the specification via the break element. Can you either indicate what you need that the break element does not provide or further explain your suggestion?

Email Trail:



Issue SSCR134-2

From Dave Pawson

2. Re the external 'words' file.
Although the lexicon has been included, we have found that our lexicon
has grown to some hundreds of words. It is tedious to have to repeatedly
enter the <say-as> content each time. If we could refer to the lexicon,
effectively saying <lexit>Word-to-be-pronounced-differently</lexit> i.e.
please use the pronunciation I told you about last time, then this would
save labour. (I'm not assuming a lexicon available to the synth
externally,
which could be a viable alternative, since Laureate is the only one I've
used
which had a comprehensive lexicon facility).

So yes, please allow not the <lexicon> element, but some reference to it,
elsewhere.

Analysis:

VBWG rejected this.

Dave disputed the response:

I do not accept your response re the lexicon.

As I state below, and re-iterate, the work needed to mark up individual words in a text, knowing that the tts engine gets it wrong every time, is wasted effort when the words can be collated in an external file and read by the engine prior to speaking the text.

This is a hard requirement for our SSML usage, which is accessibility based. I would also appreciate a standard format lexicon, though this is a nice to have, rather than a hard requirement. Equally SAMPA is preferable to IPA, to enable ordinary users to type in how a word should be spoken. Again this is based on reducing work load, a normal keyboard character as apposed to a Unicode character entity.

Proposed disposition: Accepted/Rejected

There will be no standard lexicon format for this draft. Regarding SAMPA/IPA, the choice of IPA expressed in Unicode provided the ability for the maximum acoustic expression in the most broadly-implemented way. All synthesis engine vendors today use different internal alphabets, and SAMPA was not believed to provide the breadth and uniformity of description available via IPA. There are online SAMPA->IPA converters available for those already familiar with SAMPA as an alphabet.

Email Trail:



Issue SSCR135-1

From Dan Brickley

http://www.w3.org/TR/speech-synthesis/#S3.4

...has an error:
"RDF" expands to "Resource Description Framework" not "...Format".

Proposed disposition: Accepted

We will correct this.

Email Trail:



Issue SSCR135-2

From Dan Brickley

I have cc:'d the Semantic Web Coordination Group since there may be a more general issue here w.r.t. the way your spec references RDF. W3C has a new batch of clarified/bugfixed RDF specs approaching Last Call. We should take care to note that your spec references RDF, and try to offer suggestions for referencing this work rather than the older efforts currently cited.

Proposed disposition: N/A

We would appreciate any specific suggestions you have for more appropriate referencing of the RDF specifications for any of the Voice Browser Working Group's specifications (SSML, VoiceXML, SRGS).

Email Trail:



Issue SSCR136-1

From Susan Lesch

In the embedded CSS, for each occurrence of font-family: mono; use
"monospace" instead.

Results of
http://www.w3.org/TR/2002/WD-speech-synthesis-20021202/,spell gives
lots of false positives but it did find these:

s/estabilish/establish/
s/sucessfully/successfully/
s/behaviour/behavior/

In 3.2.1, one [RFC2396] -> RFC 2396

Some of the authors/editors/publishers names in the references section
are followed by a comma, one nothing, and some by a period. I'd make
them match.

s/The W3C Standard/This W3C standard/
s/'anyURI '/'anyURI'/
s/whitespace/white space/ (see http://www.w3.org/TR/REC-xml#sec-common-syn)
s/meta data/metadata/
s/namespaces in XML/Namespaces in XML/
s/members of the W3C Voice Browser Working Group/participants in
the W3C Voice Browser Working Group/
s/mime type/MIME type/

We've changed the preferred spelling
s/Acknowledgements/Acknowledgments/ (my error)

It isn't necessary to capitalize normative and informative in the
appendixes' section labels.

Proposed disposition: Accepted

We will fix these.

Email Trail:



Issue SSCR136-2

From Susan Lesch

In the first table in 2.1.4, did you mean to spell out an example for
'telephone'?

Analysis:

VBWG asked for clarification on the question and noted that the examples in this section were expected to be removed altogether in the next draft.

Susan clarified her question:

Sorry I wasn't clear. The other "number" values show pronunciation:

     <say-as interpret-as="number" format="ordinal">5</say-as> : fifth

I just wondered if you meant to do that for telephone as well:

     <say-as interpret-as="number"
       format="telephone">123-456-7890</say-as>

Proposed disposition: N/A

We would like to do something like what you suggest in a version beyond SSML 1.0. For this version we will actually be removing all of the examples to avoid confusing about what is being standardized.

Email Trail:



Issue SSCR136-3

From Susan Lesch

Can "concatenative-type synthetic speech systems" be simplified as
"concatenating synthetic speech systems"?

Proposed disposition: Rejected

Although your suggested wording reads somewhat better, it does not have quite the same meaning as the original wording. We will leave the original wording in place.

Email Trail:



Issue SSCR136-4

From Susan Lesch

Did you mean "Tim Berners-Lee's" or someone else, "Tim Lee's"?

Proposed disposition: Accepted

We will change the name altogether.

Email Trail:



Issue SSCR137-1

From Dave Pawson

Our use case, www.rnib.org.uk is to produce globs of analogue audio on either CD or audio compact cassette for delivery to our customers.

Whichever media, we need to be able to set it off playing, and have the application stop when the 2 hours or 45 mins is reached.

Could anyone say if this is possible using the mark element? something along the lines of guess the likely stopping places around the 45 mins point, then have a system event do something to wake the operator up / send an email or something.

Waiting around for 22 hours of audio to play out is kinda wasteful :-)
Any feedback appreciated.

Analysis:
VBWG stated this was out of scope for SSML itself. Dave replied:
If you deem it to be out of scope of SSML then fine.
I'm less convinced that SSML provides the triggers
to enable this functionality.

I take this as a use case that you are rejecting.
OK.

I would request that this use case be considered for
the requirements for the next issue.

Proposed disposition: N/A

As you request, we will consider this use case for the requirements for the next version of SSML.

Email Trail:



Issue SSCR138-1

From Marc Schroeder

this is a minor comment regarding the SSML <break> element (http://www.w3.org/TR/2002/WD-speech-synthesis-20021202/#S2.2.3), more specifically regarding the meaning of the attribute value "none" for the time attribute.

The specification currently defines:

The value "none" indicates that a normal break boundary should be used.
This currently seems to make "none" a synonym of "default". Much more useful would be the possibility to explicitly forbid the occurrence of a break where the default rules of a TTS system would (erroneously) place one. I therefore suggest replacing the above sentence with:
The value "none" indicates that no break boundary should occur.

Proposed disposition: Accepted with changes

We will be reintroducing the distinction between break strength and time as suggested by Alex Monaghan. The solution will also have the following characteristics:
  1. It will be possible to have a break of strength="none" to ensure that no break occurs.
  2. When only the strength attribute is present, the break time will be based on the processor's interpretation of the strength value.
  3. When only the time attribute is present, other prosodic strength indicators may change at the discretion of the processor.
  4. When neither attribute is present, the element will increase the break strength from its current value.

Email Trail:



Issue SSCR139-1

From Alex Monaghan

firstly, i think marc is quite right - <break time="none"> is now equivalent
to the absence of a break tag, which is pointless.
secondly, <break time="none"> SHOULD mean a prosodic break (end of
intonational phrase, boundary tone, or whatever) but no pause. we seem to
have lost the possibility of specifying the end of a multi-word prosodic
chunk which is not marked by a pause, e.g. one which is marked solely by
tonal and/or lengthening phenomena.
thirdly, we have lost the ability to uncouple the strength of the break
(between words, between phrases, between clauses, sentences, paragraphs,
etc.) from the duration of the associated pause. pausing is only one aspect
of prosodic breaks, yet it is now being treated as if it were the only one.
also, in many applications there is a requirement to interleave synthesis
with other audio: it can therefore be extremely useful to be able to specify
a weak break (between words) with a long pause, or a strong break (between
paragraphs) with a short pause. a couple of examples, in case this is not
clear:
a) in an educational text about rainforest animals, you might wish to insert
animal noises without interrupting the flow of the narration, as in "The
giant hairy anteater has a bloodcurdling scream (medium-strength break, long
pause for audio sample) but the roar of the okapi (very weak break, long
pause for audio sample) is even more terrifying."
b) in a dialogue between a perky cartoon character and a bookish computer
character, you might wish to have a paragraph read by the bookish character
but let the perky character start speaking immediately after, as in
"(bookish voice) ... and that is the reason why we know that the universe is
banana-shaped. (full end-of-paragraph break, but only a very short pause)
(perky voice) Yes, yes, all very interesting but can I eat the banana now?!"

in conclusion, please bring back the distinction between break strength and
associated pause time, so that they can be specified independently, and make
<break time="none"> a default-strength break with no pause.

Proposed disposition: Accepted with changes

We will be reintroducing the distinction between break strength and time as you suggest. The solution will also have the following characteristics:
  1. It will be possible to have a break of strength="none" to ensure that no break occurs.
  2. When only the strength attribute is present, the break time will be based on the processor's interpretation of the strength value.
  3. When only the time attribute is present, other prosodic strength indicators may change at the discretion of the processor.
  4. When neither attribute is present, the element will increase the break strength from its current value.

Email Trail:



Issue SSCR140-1

From Andrew Thompson

[1] 2.1.6 Sub Element

Does the table presented in this section have unintentional duplicates? 
If not, it would be helpful to explain the difference between:

"interpret-as: number format: ordinal" and the later

"interpret-as: ordinal"

This seem to be two ways of specifying the same functionality?

Proposed disposition: N/A

Actually, this case in the examples in section 2.1.4 was to show that there are multiple ways this functionality might be specified (when specified at a later date). In any case, we will be removing these examples because they have led to confusion about whether or not the values of the attributes have been specified, which they haven't.

Email Trail:



Issue SSCR140-2

From Andrew Thompson

2.2.1 Voice Element

name attribute: No whitespace in the name seems overly restrictive - 
why not just comma separate the list of names as with font-face is CSS? 
The voice names are implementation dependent, therefore if whitespace 
is not allowed the SSML implementor will potentially have to map native 
voice names to SSML voice names, which seems to make SSML harder to use 
for developers (and possibly users).

Proposed disposition: Rejected

We chose space-separated tokens for consistency with other XML specifications. The NMTOKENS data type, for example, commonly used in XML-based specifications Document Type Definitions, is a space-separated list of NMTOKEN. Using whitespace as the separator also simplifies XSLT style sheets operating on SSML.

Email Trail:



Issue SSCR140-3

From Andrew Thompson

2.2.1 Voice Element

variant attribute: Variant is defined as an integer. The spec states 
"eg, the second or next male child voice" but it does not specify how 
to express "next" as an integer. Would this be "+1" for next and "-1" 
for previous, or something else?

Proposed disposition: N/A

We will clarify the specification to indicate that only positive integers are expected (without pluses or minuses). We will also remove "or next", since relative specifiers were never intended for this attribute.

Email Trail:



Issue SSCR140-4

From Andrew Thompson

2.2.1 Voice Element

Relating to this point, in general I have found it useful to be able 
to ask for voices like this: "give me an adult male voice, which must 
not be the same as the current voice". This can be used to implement 
"barge-in" type functionality. It might be worthwhile considering 
adding another attribute "exclude", in this fashion

<voice gender="male" age="30" exclude="bruce, fred">

"current" could then be a special voice name:

<voice gender="male" age="30" exclude="current"> - give me any adult 
male voice so long as its not the same as the current voice. This 
allows one to specify a similar voice in a more natural way than 
relying on the proposed "variant" attribute. The value of "variant" is 
a simple integer index and would be vendor specific anyway. "Exclude" 
would also make sense if a future SSML spec defines some standard voice 
names with well known characteristics.

Proposed disposition: Rejected

This is a great suggestion. We will be happy to consider this new feature for the next version of SSML (after 1.0).

Email Trail:



Issue SSCR140-5

From Andrew Thompson

2.2.3 Break element

time attribute: The value of "none" seems troublesome to me, if I read

<break time="none">

in a document, I would assume it meant "do not place a break between 
these elements" (break of length 0 seconds).
The spec defines 'The value "none" indicates that a normal break 
boundary should be used. The other five values indicate increasingly 
large break boundaries between words.'

I'd prefer <break time="default"> for this functionality. It seems more 
natural, and is more consistent with usage in 'section 2.2.4 prosody'. 
"none" could be retained, and mean "a short (ideally zero length) 
break", if the group feels engines can support that.

Proposed disposition: Accepted/Rejected

We will be reintroducing the distinction between break strength and time as suggested by Alex Monaghan. The solution will also have the following characteristics:
  1. It will be possible to have a break of strength="none" to ensure that no break occurs.
  2. When only the strength attribute is present, the break time will be based on the processor's interpretation of the strength value.
  3. When only the time attribute is present, other prosodic strength indicators may change at the discretion of the processor.
  4. When neither attribute is present, the element will increase the break strength from its current value.

Email Trail:



Issue SSCR140-6

From Andrew Thompson

3.3 Pronunciation Lexicon

On the question of element specific lexicons raised in the document, I 
note one could use say-as as a limited way of having element specific 
pronunciation, eg,
<say-as interpret-as="lexiconKey" lexicon="british.file">tomato</say-as>

Of course, this is is really just another way of achieving what the 
<phoneme> element does.

Proposed disposition: Rejected

As you pointed out, this specific use case can be accomplished via the <phoneme> element.

Email Trail:



Issue SSCR140-7

From Andrew Thompson

3.3 Pronunciation Lexicon

My general concern about element specific lexicons is the processing 
cost. eg, assume the document as a whole has a lexicon in use (A), and 
a sub element specifies a new lexicon (B). Presumably the synthesis 
engine must perform lookups as if (A) and (B) are merged,  overriding 
pronunciations which occur in A with those in B. It then needs to 
unload (B) when the element is exited. This sounds like it could prove 
too costly for a handheld device (PDA, Cellphone), and indeed, even a 
desktop system might struggle to change lexicon every other word.

At the very least I think this feature would have to be implemented 
with no more granularity than per <paragraph> element. <sentence> seems 
too fine grained.

Proposed disposition: N/A

Thank you for your feedback. After extensive discussion, we were unable to find sufficient use cases to warrant adding the lower-level lexicon functionality to the specification at this time.

Email Trail:



Issue SSCR140-8

From Andrew Thompson

Appendix A: Example SSML

The first example has:

<sentence>The step is from Stephanie Williams and arrived at 
<break/>3:45</sentence>

The time attribute is optional on <break>, but section 2.2.3 does not 
specify what the default value for the "time" attribute is when it is 
not specified. If the default value is "none" then the break used is 
the normal word break length, which is not what the example above 
implies, it implies something longer than a normal break. SEE ALSO my 
comment on <break> above.

Proposed disposition: N/A

See response to point 5.

Email Trail:



Issue SSCR141-1

From Al Gilman

1. Pronunciation Lexicon

The Voice Browser working group contemplates perhaps producing a format
specification for a pronunciation lexicon document type which would be used
with SSML and other formats.

Some of our applications depend on using [something like SSML] together with
a lexicon to reach an acceptable level of speech quality.

We look forward to the availability of that piece of the system.

You asked for comments as to whether there should be able to be lexicon
references proper to elements not the root element. Yes, this should be
possible. We look forward to the use of lexicon support for not only
pronunciation but also semantic interpretation. See for example

Checkpoint 4.3 Annotate complex, abbreviated, or...
http://www.w3.org/TR/WCAG20/#provide-overview

Interoperable Language System (ILS)
http://www.ubaccess.com/ils.html

Block quotes of technical material would benefit from their own lexicon
bindings, particularly if in one SSML document there are such block
quotes from different disciplines.

Proposed disposition: Rejected

Thank you for your feedback on this issue. After extensive discussion, we were unable to find sufficient use cases to warrant the addition of lower-level lexicon changes to the specification at this time. Also, as we discussed with you in Boston in March, the addition of semantic information to the lexicon format is a topic best reserved for discussion when the standardized lexicon format discussion resumes in the Voice Browser Working Group. It will not be addressed as part of SSML. We would also encourage you to discuss your semantic information interests with our Semantic Interpretation subgroup to help both groups come to a better understanding of how our work might fit together.

Email Trail:



Issue SSCR141-2

From Al Gilman

2. [editorial] Use of the term "final form." Don't. It will just raise
more questions than it answers, it would appear.

Proposed disposition: Accepted

We will rephrase this.

Email Trail:



Issue SSCR141-3

From Al Gilman

3. VoiceXML took the 'audio' element from SSML, As a result of the Last
Call review of VoiceXML 2.0 this element got changed a bit. Please bring
the 'audio' element as used in SSML into agreement with the definition in
VoiceXML 2.0, including the specification language defining and describing
the 'desc' element.

http://www.w3.org/Voice/Group/2002/voicexml20-disposition.htm#R478-1

Proposed disposition: Rejected

We believe the audio element is up to date. In what way is it not?

Email Trail:



Issue SSCR141-4

From Al Gilman

4. Please consider the addition of a conformance clause defining a base
profile of voice adaptation capabilities, as required to be sure to produce
recognizable speech under all conditions of hearing impairment which can
readily or reasonably be worked around through the adjustment of speech
characteristics readily implemented in the speech synthesis engine. Compare
with parameters identified for user control in the User Agent Accessibility
Guidelines Checkpoints 4.9 through 4.13

http://www.w3.org/TR/UAAG10/guidelines.html#gl-user-control-styles

And that SSML processors will, for all languages that they support, follow the
xml:lang indications in the markup. Compare with:

http://www.w3.org/TR/WCAG10-TECHS/#tech-identify-changes

Proposed disposition: Accepted with changes

Regarding conformance language for SSML profiles: we have decided not to develop profiles for the first round of the Voice Browser Group's specifications. However, we understand the need motivating your request. In seems to us that much of the functionality you might want could easily be achieved by a pre-processor, if not all. We would welcome an appendix describing how to use a pre-processor for such cases. For capabilities that cannot be achieved via a pre-processor, we would welcome help with a profile for the next version of SSML and a set of authoring guidelines for SSML 1.0 today that we can include as either a separate appendix or as part of the "pre-processor" appendix mentioned above. In addition, we would strongly encourage you to send in requirements such as this now for use in the requirements phase for the successor to VoiceXML 2.0

Regarding xml:lang indications: SSML processors are already required to follow the xml:lang indications in the markup for all languages that they support.

Email Trail:



Issue SSCR142-1

From WAI-PF

In view of the timing, this comment has not been discussed in the PF group. It is, however, based on a clash between the design of the "say-as" construct and the goals set out in the current draft of the XML Accessibility Guidelines (XAG), which has been discussed there quite a bit.

References:

"say-as" element
http://www.w3.org/TR/speech-synthesis/#S2.1.4

XAG Checkpoint 2, (have a model) and 4 (export the model):
http://www.w3.org/TR/xag#g2_0
http://www.w3.org/TR/xag#g4_0

</note>

The say-as element has attributes names "interpret-as" and "format." However, the format specification neither defines these in such as way as to create an interoperable information capture, nor does it require the user of these attributes to do so in user-provided declarations.

The examples given of ordinal numbers and telephone numbers, on the other hand, are clear examples of information elements with well-posed value domains and application semantics. This categorical information would be valuable in a variety of adaptation to meet the needs of people with disabilities, such as re-representing the information in modes other than speech.

A XAG-compliant dialect would ensure that all interpret-as and format values assigned to speak-able strings were machinably connected with machine-comprehensible expressions of the proper characteristics where machine-comprehensible expression of such characteristics was readily achievable.

In fact, in the use cases suggested in the examples, the 'format' attribute is used for indicating the semantic variety involved, while the "interpret-as" attribute is used for the more presentation-level encoding of these information items in overloads of the integers.

This element in violation of XAG Checkpoint 4.9, "Do not assume that element or attribute names provide any information about element semantics." It is ironically so, in that the name at least for the 'format' attribute is the opposite of its suggested usage.

There is no semantics actually defined for these attributes, except for possible heuristic values which are clearly only understood within the working group, as they in some cases are the reverse of common usage. These two attributes are semantically as specific as the html:any.class attribute, but named in a way as to appear more specific although they are not. The language would be better off to stick with .class as in HTML if there in no semantic backup for the values applied under these names, but the format should set up mechanisms for backups as to the sense of the values of marks which guide the same rendering decisions as here, and not leave these as bare user-defined strings.

This syntax, or the HTML-like .class attribute syntax, could perhaps be characterized in metadata and brought up to a level of definition meeting XAG Guideline 4.

On the other hand, the information to be conveyed by markup with this element could be spelled out in the metadata section.

In future production use the information that "say-as" is designed to denote should mostly be handled by lexicon references, but the lexicon standard is not there yet. But a dc.relation.conformsTo link to a type declaration in the XSD type system would be a feasible form of inline lexicon support for the kinds of characteristics that seem to be targeted here.

Please consider ways that we can get the value domain and appropriate application information that goes with these information elements better exposed for processing in adaptive applications.

There is information that this element is trying to capture that is very important in speech rendering of texts. It is just not modeled well in this language feature. The markup should focus on the content species. An ordinal number, for example, is a well known conceptual species; there are multiple definitions in standards that one could refer to, to convey its nature. This will give the speech generation module what it needs by way of decision basis in order to inflect the voicing appropriately. An un-defined user-inserted string doesn't establish a basis for interoperation with respect to the applicable semantics. This has been clear from the history of 'rel' and 'rev' on html:link and similar attributes elsewhere.

Proposed disposition: Accepted with changes

As stated in the second paragraph of 2.1.4, the Working Group expects to produce a separate document that will define values for the interpret-as, format, and detail attributes. There is significant interest within the WG in defining these values, but not at the expense of delaying the publication of SSML 1.0. For that reason, this definition work is slated to begin when the SSML 1.0 specification reaches the Candidate Recommendation stage.

Email Trail:



Issue SSCR143-1

From Alex Monaghan

there has obviously been a deliberate decision to remove the finite set of defined attibutes from the "say-as" specification, and replace them with an infnite set of unspecified attributes. it would be interesting to know why this decision was taken: was it to increase the ease of conformance, or to increase flexibility, or because the finite set was getting too big to handle?

in any case, al gilman's points seem valid. particularly, there are some obvious and easily-understood types which should be specifiable using the "say-as" tag (e.g. number, acronym, spell-out) where all systems should use the same attribute names. of course there are other (sub)types which are not relevant to all languages or domains: morphological case on ordinal numbers in Finnish, or pronunciation of chemical formulae, for instance.

by all means allow the inclusion of additional (sub)types of "say-as", but please bring back the requirement for standard treatment of basic types, and please ensure that additional types are properly specified.

Proposed disposition: Accepted with changes

As stated in the second paragraph of 2.1.4, the Working Group expects to produce a separate document that will define values for the interpret-as, format, and detail attributes. There is significant interest within the WG in defining these values, but not at the expense of delaying the publication of SSML 1.0. For that reason, this definition work is slated to begin when the SSML 1.0 specification reaches the Candidate Recommendation stage.

Email Trail:



Issue SSCR144-1

From Richard Schwerdtfeger

In reviewing the SSML specification we (PF Group) overlooked an extremely critical missing feature in the last call draft.

It is absolutely essential that SSML support a <STOP> command.

Scenario:

Screen reader users will often hit the stop command to tell the speech synthesizer to stop speaking. Screen Readers would use the <MARK> annotation as a way to have the speech engine tell the screen reader when speech has been processed (marker processed). In the event that the user tells the screen reader to stop speaking the screen reader should be able to send a stop command to the speech engine which would utltimately flush the speech buffers. Markers not returned would help the screen reader know where the user left off in the user interface (maintain point of regard relative to what has been spoken).

I apologize for not submitting this in our last call review but this is a hard requirement. Otherwise, we SSML cannot support screen readers.

Proposed disposition: Rejected

This request was removed by the requestor.

Email Trail:



Issue SSCR145-1

From I18N Interest Group

[01]  For some languages, text-to-speech conversion is more difficult
       than for others. In particular, Arabic and Hebrew are usually
       written with none or only a few vowels indicated. Japanese
       often needs separate indications for pronunciation.
       It was no clear to us whether such cases were considered,
       and if they had been considered, what the appropriate
       solution was.
       SSML should be clear about how it is expected to handle these
       cases, and give examples. Potential solutions we came up with:
       a) require/recommend that text in SSML is written in an
       easily 'speakable' form (i.e. vowelized for Arabic/Hebrew,
       or with Kana (phonetic alphabet(s)) for Japanese. (Problem:
       displaying the text visually would not be satisfactory in this
       case); b) using <sub>; c) using <phoneme> (Problem: only
       having IPA available would be too tedious on authors);
       d) reusing some otherwise defined markup for this purpose
       (e.g. <ruby> from http://www.w3.org/TR/ruby/" for Japanese);
       e) creating some additional markup in SSML.
Analysis:

VBWG Rejected this.

I18N disputed response:

I suspect from discussions with WAI on this topic and some research with experts in the field, that the lack of broad support by vendors for Arabic and Hebrew is actually a function of the fact that (unvowelled) text in these scripts is more difficult to support than other scripts. Of course, this issue can be circumvented by adding vowels to all text used in SSML - that would probably be feasible for text written specifically for synthesis, but would not be appropriate for text that is intended to be read visually.

I also worry that considering only languages "supported by synthesis vendors today" is running counter to the idea of ensuring universal access. It's like saying it's ok to design the web for english if the infrastructure only supports english. The i18n group is trying to ensure that we remove obstacles to adoption of technology by people from an ever growing circle of languages and cultures.

MJD: Agreed with Richard. This is really important, and goes to the core of the I18N activity. There may be a chicken- and-egg problem for Hebrew and Arabic, and the spec should clearly state what is allowed and what not. In addition, there are enough vendors for Japanese, I guess, so Japanese could be used as an example, and Arabic/Hebrew just explained in the text.

Proposed disposition: Rejected

See http://www.w3.org/International/2003/ssml10/ssml-feedback.html for remaining discussion details.

We will add examples in English and Japanese illustrating how an author might deal with ambiguous ideographs.

Email Trail:



Issue SSCR145-2

From I18N Interest Group

General: Tagging for bidirectional rendering is not needed
[02]  for text-to-speech conversion. But there is some provision
       for SSML content to be displayed visually (to cover WAI
       needs). This will not work without adequate support of bidi
       needs, with appropriate markup and/or hooks for styling.
Analysis:

VBWG rejected this.

I18N disputed the response:

Disagree - see the example at http://www.w3.org/International/questions/qa-bidi-controls.html(in the Background) - the bidi algorithm alone is not sufficient to produce the correct ordering of text for display in this case.

xml:lang is not sufficient or appropriate to resolve bidi issues because there are many minority languages that use RTL scripts. This is an important issue.

After a joint discussion we decide that Martin will write a whitepaper describing the problem and a proposed solution and forward to Philip Hoschka's group. Philip's group will then figure out the next step. We will also add this as an issue for VBWG V3 discussions.

Proposed disposition: Rejected

See http://www.w3.org/International/2003/ssml10/ssml-feedback.html for remaining discussion details.

This issue will be presented to the Hypertext Coordination Group for review and a decision.

Email Trail:



Issue SSCR145-3

From I18N Interest Group

General: Is there a tag that allows to change the language in
[03]  the middle of a sentence (such as <html:span>)? If not,
       why not? This functionality needs to be provided.

Proposed disposition: n/a

Yes, the <voice> tag. In section 3.1.2 (xml:lang), we will note that the <voice> element can be used to change just the language.

Email Trail:



Issue SSCR145-4

From I18N Interest Group

Abstract: 'is part of this set of new markup specifications': Which set?
[04]
Analysis:

VBWG answered this.

I18N was dissatisfied with the response:

No. I suggest "The Voice Browser Working Group has sought to develop standards for markup to enable access to the Web using spoken interaction with voice browsers. The Speech Synthesis Markup Language Specification is one of these standards,..."

The VBWG considers the term "voice browsers" to be too restrictive for the use of SSML.

Proposed disposition: N/A

See http://www.w3.org/International/2003/ssml10/ssml-feedback.html for remaining discussion details.

We will replace "part of this set of new markup specifications for voice browsers," with "one of these standards".

Email Trail:



Issue SSCR145-5

From I18N Interest Group

Intro: 'The W3C Standard' -> 'This W3C Specification'

Proposed disposition: Accepted

Accepted.

Email Trail:



Issue SSCR145-6

From I18N Interest Group

Intro: Please shortly describe the intended uses of SSML here,
[06]   rather than having the reader wait for Section 4.
Analysis:

Martin Duerst and Dan Burnett spoke about this one in person. Martin wanted the addition of some text mentioning that SSML tags can be used in a stand-alone document, combined with other namespaces in a document, or imported into a namespace used in a document.
Dan suggested this might be addressed sufficiently by a planned rearrangement of the sections of the specification.

VBWG rejected this.

I18N disputed the response:

Think you should still have a short paragraph in the beginning of the intro to indicate intended use of SSML, who should use it, and how.

This will help people:

  • decide whether or not they need to read further
  • help people to understand the application of concepts better as they read the spec (for example, I was always confused about whether this was intended to be used on its own or with other markup such as xhtml, and whether that was untouched or modified existing xhtml. This made it difficult to really understand all the implication of what I was reading straight away.)

Proposed disposition: Accepted

We accept your request to better illustrate up front the contexts in which we expect SSML to occur and the common ways we expect it to be used.

Email Trail:



Issue SSCR145-7

From I18N Interest Group

Section 1, para 2: Please shortly describe how SSML and Sable are
[07]  related or different.

Proposed disposition: Accepted

Accepted.

Email Trail:



Issue SSCR145-8

From I18N Interest Group

1.1, table: 'formatted text' -> 'marked-up text'

Proposed disposition: Accepted

See http://www.w3.org/International/2003/ssml10/ssml-feedback.html for remaining discussion details.

Accepted.

Email Trail:



Issue SSCR145-9

From I18N Interest Group

1.1, last bullet: add a comma before 'and' to make
[09]  the sentence more readable

Proposed disposition: Accepted

Accepted.

Email Trail:



Issue SSCR145-10

From I18N Interest Group

1.2, bullet 4, para 1: It might be nice to contrast the 45 phonemes
[10] in English with some other language. This is just one case that
      shows that there are many opportunities for more internationally
      varied examples. Please take any such oppurtunities.
Analysis:

The VBWG asked for a specific text proposal. The I18N group replied with the following:

http://pluto.fss.buffalo.edu/classes/psy/jsawusch/psy719/Articulation-2.pdf says Hawai'ian has 11 phonemes. Hawai'ian is indeed very low in phonemes, but 11 seems too low. http://www.ling.mq.edu.au/units/ling210-901/phonology/210_tutorials/tutorial1.html gives 12 with actual details, and may be correct. http://www.sciam.com/article.cfm?articleID=000396B3-70AD-1E6E-A98A809EC5880105 contains other numbers: 18 for Hawai'ian, and more than 100 for !Kung.

We could say something like Hawaian includes fewer than 15 phonemes. Bernard Comrie's Major Languages of South Asia, The Middle East and Africa lists 29 phonemes for Persian. His book Major Languages of East & South East Asia lists 22 for Tagalog

The Atlas of Languages, by Comrie et al, lists 14 phonemes for Hawaian and says that Rotokas, a Papuan language of Bougainville in the North Solomons, is recorded in the Guiness Book of Records as the language with fewest phonemes: 5 vowels and 6 consonants.

Proposed disposition: Accepted

See http://www.w3.org/International/2003/ssml10/ssml-feedback.html for remaining discussion details.

Thank you for your suggestions. We will apply some of them to this section.

Email Trail:



Issue SSCR145-11

From I18N Interest Group

1.2, bullet 4, para 3: "pronunciation dictionary" ->
[11] "language-specific pronunciation dictionary"

Proposed disposition: Accepted with changes

See http://www.w3.org/International/2003/ssml10/ssml-feedback.html for remaining discussion details.

Instead of this change, we will add "(which may be language dependent)" after the word "dictionary".

Email Trail:



Issue SSCR145-12

From I18N Interest Group

1.2:  How is "Tlalpachicatl" pronounced? Other examples may be
[12]  St.John-Smyth (sinjen-smaithe) or Caius College
       (keys college), or President Tito (sutto) [president of the
       republic of Kiribati (kiribass)
Analysis:

The VBWG asked for a specific text proposal. After a joint discussion, it was agreed that the Mexican example would be replaced with one of the ones given above, along with a pronunciation hint as shown.

Proposed disposition: Accepted

We will replace the existing Mexican example with one of the ones given, along with a pronunciation hint using English sound-alikes as shown.

Email Trail:



Issue SSCR145-13

From I18N Interest Group

1.1 and 1.5: Having a 'vocabulary' table in 1.1 and then a
[13] terminology section is somewhat confusing.
      Make 1.1 e.g. more text-only, with a reference to 1.5,
      and have all terms listed in 1.5.

Proposed disposition: Accepted

We agree that this is confusing. We will make section 1.1 more text-only and cross-reference as necessary. We will also remove "Vocabulary" from the title of section 1.1.

Email Trail:



Issue SSCR145-14

From I18N Interest Group

1.5: The definition of anyURI in XML Schema is considerably wider
[14] than RFC 2396/2732, in that anyURI allows non-ASCII characters.
      For internationalization, this is very important. The text
      must be changed to not give the wrong impression.

Proposed disposition: Accepted

We will amend the text to indicate that only the Schema reference is normative and not the references to RFC2396/2732.

Email Trail:



Issue SSCR145-15

From I18N Interest Group

1.5 (and 2.1.2): This (in particular 'following the
[15]  XML specification') gives the wrong impression of where/how
      xml:lang is defined. xml:lang is *defined* in the XML spec,
      and *used* in SSML. Descriptions such as 'a language code is
      required by RFC 3066' are confusing. What kind of language code?
      Also, XML may be updated in the future to a new version of RFC
      3066, SSML should not restrict itself to RFC 3066
      (similar to the recent update from RFC 1766 to RFC 3066).
      Please check the latest text in the XML errata for this.

Proposed disposition: Accepted

All that you say is correct. We will revise the text to clarify as you suggest.

Email Trail:



Issue SSCR145-16

From I18N Interest Group

2., intro: xml:lang is an attribute, not an element.

Proposed disposition: Accepted

Thank you. We will correct this.

Email Trail:



Issue SSCR145-17

From I18N Interest Group

2.1.1, para 1: Given the importance of knowing the language for
[17] speech synthesis, the xml:lang should be mandatory on the root
      speak element. If not, there should be a strong injunction to use it.

Proposed disposition: Accepted

Accepted. xml:lang will now be mandatory on the root speak element.

Email Trail:



Issue SSCR145-18

From I18N Interest Group

2.1.1: 'The version number for this specification is 1.0.': please
[18] say that this is what has to go into the value of the 'version'
      attribute.

Proposed disposition: Accepted

Accepted.

Email Trail:



Issue SSCR145-19

From I18N Interest Group

2.1.2., for the first paragraph, reword: 'To indicate the natural
[19] language of an element and its attributes and subelements,
      SSML uses xml:lang as defined in XML 1.0.'

Proposed disposition: Accepted with changes

This is related to point 15. We will reword this to correct the problems you mention in that point, but the rewording may vary some from the text you suggest.

Email Trail:



Issue SSCR145-20

From I18N Interest Group

The following elements also should allow xml:lang:
[20] - <prosody> (language change may coincide with prosody change)
      - <audio> (audio may be used for foreign-language pieces)
      - <desc> (textual description may be different from audio,
           e.g. <desc xml:lang='en'>Song in Japanese</desc>
      - <say-as> (specific construct may be in different language)
      - <sub>
      - <phoneme>
Analysis:

VBWG rejected this and asked for an example of why the description would be in a language different from that in which it is embedded .

I18N disputed the response:

Not sure why you should need to use the voice element in addition to these. First, seems like a lot of redundant work.

It is also counter to the general usage of xml:lang in XHTML/ HTML, XML, etc. Eg. you don't usually use a span element if another element already surrounds the text you want to specify).

Allowing xml:lang on other tags also integrates the language information better into the structure of the document. For example, suppose you wanted to style or extract all descriptions in a particular language - this would be much easier if the xml:lang was associated directly with that content.

It would also help reduce the likelihood of errors where the voice element becomes separated from the element it is qualifying.

Re. " why the description would be in a language different from that in which it is embedded": If the author had embedded, eg, a sound-byte in another language (such as JFK saying "Ich bin ein berliner"), the desc element could be used to transcribe the text for those who cannot or do not want to play the audio. A similar approach could be used for sites that teach language or multilingual dictionaries to provide a fallback in case the audio cannot be played.

Proposed disposition: Rejected

See http://www.w3.org/International/2003/ssml10/ssml-feedback.html for remaining discussion details.

We accept your proposal to add the xml:lang attribute onto the <desc> element and will make clear that it will in no way affect the output when SSML is used in its normal manner of output audio production. We will also add an example.

Email Trail:



Issue SSCR145-21

From I18N Interest Group

2.1.2: 'text normalization' (also in 2.1.6): What does this mean?
[21] It needs to be clearly specified/explained, otherwise there may
      be confusion with things such as NFC (see Character Model).

Proposed disposition: Accepted

We will add a reference, both here and in section 2.1.6, to section 1.2, step 3, where this is described.

Email Trail:



Issue SSCR145-22

From I18N Interest Group

2.1.2, example 1: Overall, it may be better to use utf-8 rather than
[22] iso-8859-1 for the specification and the examples.

Proposed disposition: Accepted with changes

See http://www.w3.org/International/2003/ssml10/ssml-feedback.html for remaining discussion details.

The document is already in UTF-8, the default for both XML documents and W3C specifications. We will leave the Italian example in Latin-1. For everything else we will explicitly set the encoding to UTF-8. In the example, we will include the IPA characters in a comment so browsers that can display them will. Because many platform / browser / text editor combinations do not correctly cut and paste Unicode text, we'll leave the entity escape versions in the code itself. We will also comment that one would normally use the UTF-8 representation of these symbols and explain why we put them in a comment.

Email Trail:



Issue SSCR145-23

From I18N Interest Group

2.1.2, example 1: To make the example more realistic, in the paragraph
[23] that uses lang="ja" you should have Japanese text - not an English
      transcription, which may not use as such on a Japanese text-to-speech
      processor. In order to make sure the example can be viewed even
      in situations where there are no Japanese fonts available, and
      can be understood by everybody, some explanatory text can provide
      the romanized from. (we can help with Japanese if necessary)
Analysis:

The VBWG asked for the I18N group to rewrite our example appropriately. They replied with the following:

See http://www.w3.org/International/2003/ssml10/ssml-feedback.html for remaining discussion details.

Nihongo-ga wakarimasen. -> 日本語ãŒåˆ†ã‹ã‚Šã¾ã›ã‚“。

Proposed disposition: Accepted

Thanks for the Japanese text. We will incorporate it into the example.

Email Trail:



Issue SSCR145-24

From I18N Interest Group

2.1.2, 1st para after 1st example: Editorial.  We prefer "In the
[24] case that a document requires speech output in a language not
      supported by the processor, the speech processor largely determines
      the behavior."

Proposed disposition: Accepted

Accepted.

Email Trail:



Issue SSCR145-25

From I18N Interest Group

2.1.2, 2nd para after 1st example: "There may be variation..."
[25] Is the 'may' a keyword as in rfc2119? Ie. Are you allowing
      conformant processors to vary in the implementation of xml:lang?
      If yes, what variations exactly would be allowed?

Proposed disposition: N/A

Yes, the "may" is a keyword as in rfc2119, and conformant processors are permitted to vary in their implementation of xml:lang in SSML. Although processors are required to implement the standard xml:lang behavior defined by XML 1.0, in SSML the attribute also implies a change in voice which may or may not be observed by the processor. We will clarify this in the specification.

Email Trail:



Issue SSCR145-26

From I18N Interest Group

2.1.3: 'A paragraph element represents the paragraph structure'
[26] -> 'A paragraph element represents a paragraph'. (same for sentence)
      Please decide to either use <p> or <paragraph>, but not both
      (and same for sentence).

Proposed disposition: Accepted

We accept the editorial change.
We will remove the <paragraph> and <sentence> elements.

Email Trail:



Issue SSCR145-27

From I18N Interest Group

2.1.4: <say-as>: For interoperability, defining attributes
[27] and giving (convincingly useful) values for these attributes
      but saying that these will be specified in a separate document
      is very dangerous. Either remove all the details (and then
      maybe also the <say-as> element itself), or say that the
      values given here are defined here, but that future versions
      of this spec or separate specs may extend the list of values.
      [Please note that this is only about the attribute values,
       not the actual behavior, which is highly language-dependent
       and probably does not need to be specified in every detail.]

Proposed disposition: Accepted

As you suggest, we will remove the examples from this section in order to reduce confusion.

Email Trail:



Issue SSCR145-28

From I18N Interest Group

2.1.4, interpret-as and format, 6th paragraph: requirement that
[28] text processor has to render text in addition to the indicated
      content type is a recipe for bugwards compatibility (which
      should be avoided).

Proposed disposition: Rejected

See http://www.w3.org/International/2003/ssml10/ssml-feedback.html for remaining discussion details.

A primary goal of SSML is to render text whenever possible rather than skipping content or throwing an error. Thus, our intent is to ensure that 'extra' content or content that does not match a given type hint is rendered in whatever way possible rather than being discarded. However, it may be important to clarify in the specification that processors may notify the hosting environment of such mismatches and that any unexpected content is free to be rendered in a manner different than any type hint might suggest.

Email Trail:



Issue SSCR145-29

From I18N Interest Group

2.1.4, 'locale': change to 'language'.

Proposed disposition: Accepted

Accepted.

Email Trail:



Issue SSCR145-30

From I18N Interest Group

2.1.4: How is format='telephone' spoken?

Proposed disposition: N/A

How it would be spoken is processor-dependent. The <say-as> element only provides information on how to interpret (or normalize) a set of input tokens, not on how it is to be spoken.
Also, as you pointed out in point 27, "format='telephone'" is merely an example and not a specified value, at least not at this time.

Email Trail:



Issue SSCR145-31

From I18N Interest Group

2.1.4: Why are there 'ordinal' and 'cardinal' values for both
[31]   interpret-as and format?

Proposed disposition: N/A

Both are shown as examples to indicate two possible ways it could be done. Neither is actually a specified way to use the element, as you pointed out in point 27.

Email Trail:



Issue SSCR145-32

From I18N Interest Group

2.1.4 'The detail attribute can be used for all say-as content types.'
[32]   What's a content type in this context?

Proposed disposition: N/A

This wording was accidentally left over from an earlier draft. We will correct it.

Email Trail:



Issue SSCR145-33

From I18N Interest Group

2.1.4 detail 'strict': 'speak letters with all detail': As opposed
[33]  to what (e.g. in that specific example)?

Proposed disposition: N/A

In this example, without the detail attribute a processor might leave out the colon or the dash, or it might not distinguish between lower case and capital letters.
However, this is not actually a specified way to use the attribute, as you pointed out in point 27.

Email Trail:



Issue SSCR145-34

From I18N Interest Group

2.1.4, last table: There seem to be some fixed-width aspects in the
[34]   styling of this table. This should be corrected to allow complete
        viewing and printing at various overall widths.

Proposed disposition: Rejected

As you suggested in point 27, we will be removing all of the tables of examples in this section. If and when we re-introduce this table, we will correct any styling errors that remain.

Email Trail:



Issue SSCR145-35

From I18N Interest Group

2.1.4, 4th para (and several similar in other sections):
[35]  "The say-as element can only contain text." would be easier
       to understand; we had to look around to find out whether the
       current phrasing described an EMPTY element or not.

Proposed disposition: Accepted with changes

This statement you refer to that is present in all of the element descriptions will be modified to more fully describe the content model for the element, although it may not be worded exactly as you suggest.

Email Trail:



Issue SSCR145-36

From I18N Interest Group

2.1.4. For many languages, there is a need for additional information.
[36]   For example, in German, ordinal numbers are denoted with a number
       followed by a period (e.g. '5.'). They are read depending on case
       and gender of the relevant noun (as well as depending on the use
       of definite or indefinite article).

Proposed disposition: Rejected

We have had considerable discussion on this point. There are two parts to our response:
  1. It is assumed that the synthesis processor will use all contextual information already at its disposal in order to render the text and markup it is given. For example, any relevant case or gender information that can be determined from text surrounding the <say-as> element is expected to be used.
  2. The ways and contexts in which information other than the specific number value can be encoded via human language are many and varied. For example, the way you count in Japanese varies based on the type of object that you are counting. That level of complexity is well outside the intended use of the <say-as> element. It is expected in such cases that either the necessary contextual information is available, in normal surrounding text, as described in part 1 above, or the text is normalized by the application writer (e.g. "2" -> "zweiten").
    We welcome any complete, multilingual proposals for consideration for a future version of SSML.

Email Trail:



Issue SSCR145-37

From I18N Interest Group

2.1.4, 4th row of 2nd table: I've seen some weird phone formats, but
[37]  nothing quite like this! Maybe a more normal example would NOT
       pronounce the separators. (Except in the Japanese case, where the
       spaces are (sometimes) pronounced (as 'no').)

Proposed disposition: Rejected

As you suggested in point 27, we will be removing these examples altogether. If we should decide to reintroduce them at some point, we would be happy to incorporate a revised or extended example from you.

Email Trail:



Issue SSCR145-38

From I18N Interest Group

2.1.5, <phoneme>:
[38]  It is unclear to what extent this element is designed for
       strictly phonemic and phonetic notations, or also (potentially)
       for notations that are more phonetic-oriented than usual writing
       (e.g. Japanese kana-only, Arabic/Hebrew with full vowels,...)
       and where the boundaries are to other elements such as <say-as>
       and <sub>. This needs to be clarified.

Proposed disposition: Accepted

We will clarify in the text that this element is designed for strictly phonemic and phonetic notations and that the example uses Unicode to represent IPA. We will also clarify that the phonemic/phonetic string does not undergo text normalization and is not treated as a token for lookup in the lexicon, while values in <say-as> and <sub> may undergo both.

Email Trail:



Issue SSCR145-39

From I18N Interest Group

2.1.5 There may be different flavors and variants of IPA (see e.g.
[39]  references in ISO 10646). Please make sure it is clear which
       one is used.

Proposed disposition: Accepted

See http://www.w3.org/International/2003/ssml10/ssml-feedback.html for remaining discussion details.

Accepted. We will clarify this via references to the IPA and/or Unicode.

Email Trail:



Issue SSCR145-40

From I18N Interest Group

2.1.5 IPA is used both for phonetic and phonemic notations. Please
[40]  clarify which one is to be used.

Proposed disposition: Accepted

IPA is an alphabet of phonetic symbols. The only representation in IPA is phonetic, although it is common to select specific phones as representative examples of phonemic classes. Also, IPA is only one possible alphabet that can be used in this element. The <phoneme> element will accept both phonetic and phonemic alphabets, and both phonetic and phonemic string values for the ph attribute. We will clarify this and add or reference a description of the difference between phonemic and phonetic.

Email Trail:



Issue SSCR145-41

From I18N Interest Group

2.1.5 This may need a note that not all characters used in IPA are
[41]  in the IPA block.

Proposed disposition: Accepted

Accepted.

Email Trail:



Issue SSCR145-42

From I18N Interest Group

2.1.5 This seems to say that the only (currently) allowed value for
[42]  alphabet is 'ipa'. If this is the case, this needs to be said
       very clearly (and it may as well be defined as default, and
       in that case the alphabet attribute to be optional). If there
       are other values currently allowed, what are they? How are
       they defined?

Proposed disposition: Accepted

See http://www.w3.org/International/2003/ssml10/ssml-feedback.html for remaining discussion details.

We accept your proposal to limit the values of the alphabet attribute of the <phoneme> element to the string "ipa" or strings of the form "x-company" or "x-company- alphabet", where "company" and "alphabet" are vendor-specific values intended to differentiate among alphabets used by different synthesis engine vendors.

Email Trail:



Issue SSCR145-43

From I18N Interest Group

2.1.5 'alphabet' may not be the best name. Alphabets are sets of
[43]  characters, usually with an ordering. The same set of characters
      could be used in totally different notations.

Proposed disposition: Accepted

See http://www.w3.org/International/2003/ssml10/ssml-feedback.html for remaining discussion details.

We will add a note explaining the specific use of 'alphabet' in this context.

Email Trail:



Issue SSCR145-44

From I18N Interest Group

2.1.5 What are the interactions of <phoneme> for foreign language
[44]  segments? Do processors have to handle all of IPA, or only the
       phonemes that are used in a particular language? Please clarify.

Proposed disposition: Accepted

See http://www.w3.org/International/2003/ssml10/ssml-feedback.html for remaining discussion details.

It is our intent that processors must syntactically accept all strings of valid IPA codes (where validity is defined by the text we will write for item 39). It is our intent that processors should produce output when given IPA codes that can reasonably be considered to belong to the current language. The production of output when given other codes is entirely at processor discretion. We will add these explanations to the specification.

Email Trail:



Issue SSCR145-45

From I18N Interest Group

2.1.5, 1st example:  Please try to avoid character entities, as it
[45] suggests strongly that this is the normal way to input this stuff.
      (see also issue about utf-8 vs. iso-8859-1)
Analysis:

The VBWG asked the I18N group what the normal way was. They replied with the following:

Pure character data in utf-8.

Proposed disposition: Accepted with changes

See http://www.w3.org/International/2003/ssml10/ssml-feedback.html for remaining discussion details.

We will change the example as described in the response to point 22.

Email Trail:



Issue SSCR145-46

From I18N Interest Group

2.1.5 and 2.1.6: The 'alias' and 'ph' attributes in some
[46]  cases will need additional markup (e.g. for fine-grained
       prosody, but also for additional emphasis, bidirectionality).
       This would also help tools for translation,...
       But markup is not possible for attributes. These attributes
       should be changed to subelements, e.g. similar to the <desc>
       element inside <audio>.

Proposed disposition: Accepted/Rejected

See http://www.w3.org/International/2003/ssml10/ssml-feedback.html for remaining discussion details.

"the alias attribute is to be spoken" will be added to the text.
The remainder of this issue is to be deferred to the next version of SSML.

Email Trail:



Issue SSCR145-47

From I18N Interest Group

2.1.5 and 2.1.6: Can you specify a null string for the ph and alias
[47] attributes? This may be useful in mixed formats where the
      pronunciation is given by another means, e.g. with ruby annotation.
Analysis:

VBWG rejected this.

I18N disputed the response:

If SSML will be grafted onto ordinary Japanese text written in, say, XHTML it is certain that at some point ruby text will be encountered. This is a visual device, but is character-based, involving a repetition of a portion of text in two different scripts - so the base text and the ruby text would be both read out by the synthesiser. This would not only sound strange, but be very distracting.

What we are asking is for the ability to nullify one of the runs of text.

It seems to me that this could happen in a number of ways:Presumably this could be done by removing the annotation or base in ruby text, but being able to nullify

  1. by removing the base in ruby text
  2. by allowing for the text in the base to be not spoken, either by application of a null string or a style assignment
  3. by the speech processor recognising ruby and dealing with it appropriately.

I would like to know what the SSML group thinks is the best approach, and think that you should add some note about expected behaviour in this case.

Proposed disposition: Rejected

See http://www.w3.org/International/2003/ssml10/ssml-feedback.html for remaining discussion details.

We will defer this and reconsider this for the next version.

Email Trail:



Issue SSCR145-48

From I18N Interest Group

2.1.6 The <sub> element may easily clash or be confused with <sub>
[48]  in HTML (in particular because the specification seems to be
       designed to allow combinations with other markup vocabularies
       without using different namespaces). <sub> should be renamed,
       e.g. to <subst>.
Analysis:

VBWG rejected this.

I18N disputed the response:

I still think, regardless of the potential for overlapping element names, that it would be more immediately apparent what the meaning of this element was (and therefore more user friendly) if it was called <subst>.

Proposed disposition: Rejected

See http://www.w3.org/International/2003/ssml10/ssml-feedback.html for remaining discussion details.

We have not seen enough general interest to warrant this change.

Email Trail:



Issue SSCR145-49

From I18N Interest Group

2.1.6 For abbreviations,... there are various cases. Please check
[49]  that all the cases in
       http://lists.w3.org/Archives/Member/w3c-i18n-ig/2002Mar/0064.html
       are covered, and that the users of the spec know how to handle
       them.

Proposed disposition: Accepted

We will clarify within the text how application authors should handle the cases presented in the referenced email.

Email Trail:



Issue SSCR145-50

From I18N Interest Group

2.1.6, 1st para: "the specified text" ->
[50]   "text in the alias attribute value".

Proposed disposition: Accepted

Accepted.

Email Trail:



Issue SSCR145-51

From I18N Interest Group

2.2.1, between the tables: "If there is no voice available for the
[51]  requested language ... select a voice ... same language but different
       region..."  I'm not sure this makes sense.  I could understand that
       if there is no en-UK voice you'd maybe go for an en-US voice - this
       is a different DIALECT of English.  If there are no Japanese voices
       available for Japanese text, I'm not sure it makes sense to use an
       English voice. What happens in this situation?

Proposed disposition: N/A

See http://www.w3.org/International/2003/ssml10/ssml-feedback.html for remaining discussion details.

We will rewrite this section to make clear that a) there is an algorithm, b) the algorithm is ambiguous in some cases, and c) where ambiguous, the behavior is processor-specific.
We will also change "requested language" to "requested xml:lang".

Email Trail:



Issue SSCR145-52

From I18N Interest Group

2.2.1 It should be mentioned that in some cases, it may make sense to have
[52]  a short piece of e.g. 'fr' text in an 'en' text been spoken by
       an 'en' text-to-speech converter (the way it's often done by
       human readers) rather than to throw an error. This is quite
       different for longer texts, where it's useless to bother an
       user.
Analysis:

VBWG rejected this.

I18N disputed the response:

even if this is already allowed at processor discretion, many implementers may forget that this may be a more reasonable behavior, so it should be mentioned.

Proposed disposition: Accepted

See http://www.w3.org/International/2003/ssml10/ssml-feedback.html for remaining discussion details.

We will describe this situation in the document and provide an example.

Email Trail:



Issue SSCR145-53

From I18N Interest Group

2.2.1: We wonder if there's a need for multiple voices (eg. A group of kids)

Proposed disposition: N/A

We have not had significant demand to standardize a value for this, e.g. <voice name="kids">. Individual processors are of course permitted to provide any voices they wish.

Email Trail:



Issue SSCR145-54

From I18N Interest Group

2.2.1, 2nd example: You should include some text here.
Analysis:

VBWG asked for example text in Japanese. Martin D. promised to send the example within a week or, failing that, to withdraw the request.

Proposed disposition: Accepted

See http://www.w3.org/International/2003/ssml10/ssml-feedback.html for remaining discussion details.

No text provided as requested.

Email Trail:



Issue SSCR145-55

From I18N Interest Group

2.2.1 The 'age' attribute should explicitly state that the integer
[55]  is years, not something else.

Proposed disposition: Accepted

Accepted.

Email Trail:



Issue SSCR145-56

From I18N Interest Group

2.2.1 The variant attribute should say what it's index origin is
[56]  (e.g. either starting at 0 or at 1)

Proposed disposition: Accepted

Accepted. The text and schema will be adjusted to clarify that this attribute can only contain positive integers.

Email Trail:



Issue SSCR145-57

From I18N Interest Group

2.2.1 attribute name: (in the long term,) it may be desirable to use
[57]  an URI for voices, and to have some well-defined format(s)
       for the necessary data.

Proposed disposition: Rejected

This is an interesting suggestion that we will be happy to consider for the next version of SSML (after 1.0).

Email Trail:



Issue SSCR145-58

From I18N Interest Group

2.2.1, first example (and many other places): The line break between
[58]  the <voice> start tag and the text "It's fleece was white as snow."
       will have negative effects on visual rendering.
       (also, "It's" -> "Its")

Proposed disposition: N/A

See http://www.w3.org/International/2003/ssml10/ssml-feedback.html for discussion details.

Email Trail:



Issue SSCR145-59

From I18N Interest Group

2.2.1, description of priorities of xml:lang, name, variant,...:
[59]  It would be better to describe this clearly as priorities,
       i.e. to say that for voice selection, xml:lang has highest
       priority,...

Proposed disposition: Accepted with changes

See http://www.w3.org/International/2003/ssml10/ssml-feedback.html for remaining discussion details.

We like the existing text and will keep it. However, we will also add (upfront) a description based on priorities as you suggest.

Email Trail:



Issue SSCR145-60

From I18N Interest Group

2.2.3 What about <break> inside a word (e.g. for long words such as
[60]  German)? What about <break> in cases where words cannot
       clearly be identified (no spaces, such as in Chinese, Japanese,
       Thai). <break> should be allowed in these cases.
Analysis:

VBWG rejected this.

I18N disputed the response:

I'm confused. The reply says rejected, but then goes on to show an example of what we asked for. If a <break> automatically creates a boundary, then just say that it can be used in the middle of a word (or phrase in languages without spaces) and that's what happens.

Proposed disposition: Accepted

See http://www.w3.org/International/2003/ssml10/ssml-feedback.html for remaining discussion details.

Inserting any element adds a lexical boundary, so while it is acceptable to insert a break in the middle of a word or phrase, this will create a new lexical boundary, effectively splitting the one word or phrase into two. We will clarify the relationship between words and tokens in the Introduction and that breaking one token into multiple tokens will likely affect how the processor treats it. A simple English example is "cup<break/>board"; the processor will treat this as the two words "cup" and "board" rather than as one word with a pause in the middle.

Email Trail:



Issue SSCR145-61

From I18N Interest Group

2.2.3 and 2.2.4: "x-high" and "x-low": the 'x-' prefix is part of
[61]  colloquial English in many parts of the world, but may be
       difficult to understand for non-native English speakers.
       Please add an explanation.

Proposed disposition: Accepted

We will add such an explanation.

Email Trail:



Issue SSCR145-62

From I18N Interest Group

2.2.4: Please add a note that customary pitch levels and
[62]  pitch ranges may differ quite a bit with natural language, and that
       "high",... may refer to different absolute pitch levels for different
       languages. Example: Japanese has general much lower pitch range than
       Chinese.

Proposed disposition: Accepted

Accepted.

Email Trail:



Issue SSCR145-63

From I18N Interest Group

2.2.4, 'baseline pitch', 'pitch range': Please provide definition/
[63]   short explanation.

Proposed disposition: Accepted

We will add this.

Email Trail:



Issue SSCR145-64

From I18N Interest Group

2.2.4 'as a percent' -> 'as a percentage'

Proposed disposition: Accepted

Accepted.

Email Trail:



Issue SSCR145-65

From I18N Interest Group

2.2.4 What is a 'semitone'? Please provide a short explanation.

Proposed disposition: Accepted

We will add this.

Email Trail:



Issue SSCR145-66

From I18N Interest Group

2.2.4 In pitch contour, are white spaces allowed? At what places
[66]  exactly? In "(0%,+20)(10%,+30%)(40%,+10)", I would propose
       to allow whitespace between ')' and '(', but not elsewhere.
       This has the benefit of minimizing syntactict differences
       while allowing long contours to be formatted with line breaks.

Proposed disposition: Accepted

See http://www.w3.org/International/2003/ssml10/ssml-feedback.html for remaining discussion details.

Accepted.

Email Trail:



Issue SSCR145-67

From I18N Interest Group

2.2.4, bullets: Editorial nit.  It may help the first time reader to
[67]   mention that 'relative change' is defined a little further down.

Proposed disposition: Rejected

See http://www.w3.org/International/2003/ssml10/ssml-feedback.html for remaining discussion details.

Email Trail:



Issue SSCR145-68

From I18N Interest Group

2.2.4, 4th bullet: the speaking rate is set in words per minute.
[68]  In many languages what constitutes a word is often difficult to
       determine, and varies considerably in average length.
       So there have to be more details to make this work interoperably
       in different languages. Also, it seems that 'words per minute'
       is a nominal rate, rather than exactly counting words, which
       should be stated clearly. An much preferable alternative is to use
       another metric, such as syllables per minute, which has less
       unclarity (not

Proposed disposition: Accepted with changes

See http://www.w3.org/International/2003/ssml10/ssml-feedback.html for remaining discussion details.

Because of the difficulty in accurately defining the meaning of words per minute, syllables per minute, or phonemes per minute across all possible languages, we have decided to replace such specification with a number that acts as a multiplier of the default rate. For example, a value of 1 means a speaking rate equal to the default rate, a value of 2 means a speaking rate twice the default rate, and a value of 0.5 means a speaking rate of half the default rate. The default rate is processor-specific and will usually vary across both languages and voices. Percentage changes relative to the current rate are still permitted. Note that the effect of setting a specific words per minute rate (for languages for which that makes sense) can be achieved by explicitly setting the duration for the contained text via the duration attribute of the <prosody> element. The duration attribute can be used in this way for all languages and is therefore the preferred way of precisely controlling the rate of speech when that is desired.

Email Trail:



Issue SSCR145-69

From I18N Interest Group

2.2.4, 5th bullet: If the default is 100.0, how do you make it
[69]  louder given that the scale ranges from 0.0 to 100.0?
       (or, in other words, is the default to always shout?)

Proposed disposition: N/A

See http://www.w3.org/International/2003/ssml10/ssml-feedback.html for remaining discussion details.

Maximum volume does not equal shouting. Shouting is actually a factor of several prosodic changes, only one of which is volume.
Our internal poll determined that maximum volume was the default for most synthesis processors. The assumption is that you can a) reduce the volume within SSML and b) set the final true volume to anything you want through whatever general audio controls your audio system (PC volume control, speaker knob) has available.

Email Trail:



Issue SSCR145-70

From I18N Interest Group

2.2.4, Please state whether units such as 'Hz' are case-sensitive
[70] or case-insensitive. They should be case-sensitive, because
      units in general are (e.g. mHz (milliHz) vs. MHz (MegaHz)).

Proposed disposition: Accepted

Although the units are already marked as case-sensitive in the Schema, we will clarify in the text that such units are case-sensitive.

Email Trail:



Issue SSCR145-71

From I18N Interest Group

2.3.3 Please provide some example of <desc>

Proposed disposition: Accepted

We will add an example.

Email Trail:



Issue SSCR145-72

From I18N Interest Group

3.1  Requiring an XML declaration for SSML when XML itself
[72] doesn't require an XML declaration leads to unnecessary
      discrepancies. It may be very difficult to check this
      with an off-the-shelf XML parser, and it is not reasonable
      to require SSML implementations to write their own XML
      parsers or modify an XML parser. So this requirement
      should be removed (e.g. by saying that SSML requires an XML
      declaration when XML requires it).

Proposed disposition: Accepted

See http://www.w3.org/International/2003/ssml10/ssml-feedback.html for remaining discussion details.

We agree with and accept your suggestion to remove this requirement.

Email Trail:



Issue SSCR145-73

From I18N Interest Group

3.3, last paragraph before 'The lexicon element' subtitle:
[73] Please also say that the determination of
      what is a word may be language-specific.

Proposed disposition: Accepted

We will clarify this.

Email Trail:



Issue SSCR145-74

From I18N Interest Group

3.3 'type' attribute on lexicon element: What's this attribute used
[74] for? The media type will be determined from the document that
      is found at the 'uri' URI, or not?

Proposed disposition: n/a

See http://www.w3.org/International/2003/ssml10/ssml-feedback.html for remaining discussion details.

We will explain in the specification and follow the TAG finding on this issue.

Email Trail:



Issue SSCR145-75

From I18N Interest Group

4.1 'synthesis document fragment' -> 'speech synthesis document fragment'

Proposed disposition: Accepted

Accepted.

Email Trail:



Issue SSCR145-76

From I18N Interest Group

4.1  Conversion to stand-alone document: xml:lang should not
[76] be removed. It should also be clear whether content of
      non-synthesis elements should be removed, or only the
      markup.

Proposed disposition: Accepted

See http://www.w3.org/International/2003/ssml10/ssml-feedback.html for remaining discussion details.

Good point about xml:lang. We will modify the text to indicate that everything in our schema (including xml:lang, xml:base, etc.) is to be retained in the conversion and that all other non-synthesis namespace elements and their contents should be removed.

Email Trail:



Issue SSCR145-77

From I18N Interest Group

4.4 'requirement for handling of languages': Maybe better to
[77] say 'natural languages', to avoid confusion with markup
      languages. Clarification is also needed in the following
      bullet points.

Proposed disposition: Accepted

Agreed. We will make this change.

Email Trail:



Issue SSCR145-78

From I18N Interest Group

4.5  This should say that a user agent has to support at least
[78] one natural language.

Proposed disposition: Accepted

Agreed. We will add this.

Email Trail:



Issue SSCR145-79

From I18N Interest Group

App A: 'http://www.w3c.org/music.wav': W3C's Web site is www.w3.org.
[79]   But this example should use www.example.org or www.example.com.

Proposed disposition: Accepted

We will correct this.

Email Trail:



Issue SSCR145-80

From I18N Interest Group

App B: 'synthesis DTD' -> 'speech synthesis DTD'

Proposed disposition: Accepted

See http://www.w3.org/International/2003/ssml10/ssml-feedback.html for remaining discussion details.

Accepted.

Email Trail:



Issue SSCR145-81

From I18N Interest Group

App D: Why does this mentions 'recording'? Please remove or explain.

Proposed disposition: Accepted with changes

This was accidentally left in when originally copied from the VoiceXML specification. It will be corrected.

Email Trail:



Issue SSCR145-82

From I18N Interest Group

App E: Please give a reference for the application to the IETF/IESG/IANA
[82]   for the content type 'application/ssml+xml'.

Proposed disposition: Accepted

See http://www.w3.org/International/2003/ssml10/ssml-feedback.html for remaining discussion details.

Good suggestion. This is a general problem that applies to all of the specifications from the Voice Browser Working Group. We will address it in a consistent manner across all of our specifications by providing the most appropriate and relevant references at the time of publication.

Email Trail:



Issue SSCR145-83

From I18N Interest Group

App F: 'Support for other phoneme alphabets.': What's a 'phoneme alphabet'?

Proposed disposition: N/A

See http://www.w3.org/International/2003/ssml10/ssml-feedback.html for remaining discussion details.

We will add a link in Appendix F to section 2.1.5. Dan notes that Appendix F may disappear from the document before Recommendation.

Email Trail:



Issue SSCR145-84

From I18N Interest Group

App F, last paragraph: 'Unfortunately, ... no standard for designating
[84]   regions...': This should be worded differently. RFC 3066 provides
        for the registration of arbitrary extensions, so that e.g.
        en-gb-accent-scottish and en-gb-accent-welsh could be registered.

Proposed disposition: Accepted

Agreed. We will revise the text appropriately.

Email Trail:



Issue SSCR145-85

From I18N Interest Group

App F, bullet 3: I guess you already know that intonation
[85]   requirements can vary considerably across languages, so you'll
        need to cast your net fairly wide here.

Proposed disposition: N/A

See http://www.w3.org/International/2003/ssml10/ssml-feedback.html for remaining discussion details.

This will be treated as a deferred requirement.

Email Trail:



Issue SSCR145-86

From I18N Interest Group

App G: What is meant by 'input' and 'output' languages? This is the
[86]   first time this terminology is used. Please remove or clarify.

Proposed disposition: Accepted

This is old text. We will clarify.

Email Trail:



Issue SSCR145-87

From I18N Interest Group

App G: 'overriding the SSML Processor default language': There should
[87]   be no such default language. An SSML Processor may only
        support a single language, but that's different from
        assuming a default language.

Proposed disposition: Accepted

See http://www.w3.org/International/2003/ssml10/ssml-feedback.html for remaining discussion details.

Agreed. As with item 86, this is old text. We will correct this.

Email Trail:



Issue SSCR145-88

From I18N Interest Group

[88] The appendices should be ordered so that the normative
      ones appear before the informative ones.

Proposed disposition: Accepted

Agreed.

Email Trail:



Issue SSCR145-89

From I18N Interest Group

This is an important topic that has been discussed with other groups since we did the review.

There are a number of elements that allow only PCDATA content and attributes containing text to be spoken (eg. the alias attribute of the <sub> element, and the <desc> element).

Use of PCDATA precludes the possibility of language change or bidi markup for a part of the text.

Proposed changes:

  1. Elements should always allow for bidi markup and language change to be applied.
  2. Attributes containing text that will be spoken should be converted to elements.

[Note: we have recently discussed this with the HTML WG wrt XHTML2.0 and they have agreed to take similar action as we are recommending here.]

Proposed disposition: Rejected

See http://www.w3.org/International/2003/ssml10/ssml-feedback.html for remaining discussion details.

This is a new request well outside the timeframe for comments on this specification. We agree with the principle and will happily consider this request for a future version of SSML beyond 1.0.

Email Trail:



Issue SSCR146-1

From XML Schema WG

1. Significant Issue
There is a subtle error in the schema.  Because the speak.class group
contains elements that have anonymous types, it cannot be used in a
restriction as it is in synthesis.xsd.  This is because a particle
containing an anonymous type will never be considered a valid
restriction of another particle with an anonymous type (even if they
reuse the same type definition from a named model group).  This can
be remedied by giving the metadata and lexicon elements named types,
as in:

<xsd:group name="speak.class">
  <xsd:sequence>
    <xsd:choice minOccurs="0" maxOccurs="unbounded">
      <xsd:element name="metadata" type="metadata"/>
      <xsd:element name="lexicon" type="lexicon"/>
      <xsd:group ref="sentenceAndStructure.class"/>
    </xsd:choice>
  </xsd:sequence>
</xsd:group>

<xsd:complexType name="metadata">
  <xsd:choice minOccurs="0" maxOccurs="unbounded">
    <xsd:any namespace="##other" processContents="lax"/>
  </xsd:choice>
  <xsd:anyAttribute namespace="##any" processContents="strict"/>
</xsd:complexType>

<xsd:complexType name="lexicon">
  <xsd:attribute name="uri" type="xsd:anyURI" use="required"/>
  <xsd:attribute name="type" type="xsd:string"/>
</xsd:complexType>

2. Suggestions
a. The prose says "The metadata and lexicon elements must occur
   before all other elements and text contained within the root speak
   element" but this is not enforced in the schema. The schema cannot
   enforce that the metadata and lexicon children must appear before
   any text, but it can enforce that they must appear before other
   children, by changing speak.class as in:

<xsd:group name="speak.class">
  <xsd:sequence>
 <xsd:choice minOccurs="0" maxOccurs="unbounded">
<xsd:element name="metadata" type="metadata"/>
<xsd:element name="lexicon" type="lexicon"/>
 </xsd:choice>
 <xsd:group ref="sentenceAndStructure.class"
            minOccurs="0" maxOccurs="unbounded"/>
</xsd:sequence>
</xsd:group>

b. Since the name attribute of the mark element is declared to be
   of type ID, it may be useful to point out in the prose that this
   means that it must be unique within the containing document and
   must conform to the lexical rules for an NCName.

c. Since the name attribute of the voice element is really a
   whitespace-separated list of names, it may be better to give
   a type that represents a list of names, as in
<xs:simpleType name='voiceName'>
<xs:restriction base='xs:token'>
<xs:pattern value='\S+'/>
</xs:restriction>
</xs:simpleType>
<xs:simpleType name='listOfVoiceName'>
<xs:list itemType='voiceName'/>
</xs:simpleType>
 
d. Most of the simple types in this schema that are derived from
   string would be more easily used if they were derived from
   token. This would allow leading and trailing whitespace
   to be collapsed.

e. The type "number" was defined as a restriction of string.  It seems
   more natural to us that it should be defined as a restriction of
   decimal, as in:

<xsd:simpleType name="number">
<xsd:annotation>
<xsd:documentation>
  number: e.g. 10, 5.5, 1.5,9., .45
</xsd:documentation>
</xsd:annotation>
<xsd:restriction base="xsd:decimal">
<xsd:minInclusive="0"/>
</xsd:restriction>
</xsd:simpleType>


3. Minor problems with the examples
a. The example under "Pitch contour" in section 2.2.4 has the values
   +20 and +10 which appear to be invalid according to both the schema
   and the prose in the recommendation.
b. The examples starting in section 3.3 are all missing the required
   version attribute of the speak element
c. The second and third examples in Appendix A are missing an end
   quote on the encoding in the xml declaration
Analysis:
Scott says these are all fine except for 2d.

Proposed disposition: Accepted/Rejected

We have accepted and will apply all of your comments and suggestions except for 2d. For consistency with the other (related) specifications from the Voice Browser Working Group, we will retain the current restriction against leading and trailing whitespace in enumerated attribute values.

Email Trail:



Issue SSCR151-1

From Dave Pawson

I'm certainly not happy with the response below.
From our 3 year experience with synthetic speech it is blatantly clear
that "As long as  there is a way to write the text, the engine can figure
out
 how to speak it." produces jibberish in many cases.

This is the basis for the external 'speak as' file. The synth
can usually speak a word reasonably if 'taught' by such a
method.

Fine if the end user can glance at a piece of text, but a lot
more important if the audio is the only access the user has to information.
     
Analysis:
VBWG rejected this -- it's already possible.
Dave rejected our response but deferred to whatever WAI decides.

Proposed disposition: Rejected

We believe there is a misunderstanding that is simple to correct. There is already an ability in the specification to adjust pronunciation both internally via the phoneme element and externally via a lexicon. We agree there are times when one needs a lexicon. By placing better pronunciations for words in an external lexicon, the processor will automatically use the values in the lexicon over its own defaults without any additional markup (except for the single use of the <lexicon> element at the top of the document that points to the lexicon definition file).

Email Trail: