Copyright ©2005W3C® (MIT , ERCIM , Keio), All Rights Reserved. W3C liability, trademark, document use, and software licensing rules apply.
This document details the responses made by the Voice Browser Working Group to issues raised during the Last Call (beginning 28 July 2004 and ending 1 September 2004) review of Voice Extensible Markup Language (VoiceXML) Version 2.1. Comments were provided by Voice Browser Working Group members, other W3C Working Groups, and the public via the www-voice-request@w3.org (archive) mailing list.
This document of the W3C's Voice Browser Working Group describes the disposition of comments as of 10th May 2005 on the Last Call Working Draft of Voice Extensible Markup Language (VoiceXML) Version 2.1. It may be updated, replaced or rendered obsolete by other W3C documents at any time.
For background on this work, please see the Voice Browser Activity Statement.
This document describes the disposition of comments in relation to Voice Extensible Markup Language (VoiceXML) Version 2.1 (http://www.w3.org/TR/2004/WD-voicexml21-20040728/). Each issue is described by the name of the commentator, a description of the issue, and either the resolution or the reason that the issue was not resolved.
The full set of issues raised for the Voice Extensible Markup Language (VoiceXML) Version 2.1 since July 2004, their resolution and in most cases the reasoning behind the resolution are available from http://www.w3.org/Voice/Group/2005/voicexml21-cr.html [W3C Members Only]. This document provides the analysis of the issues that were submitted and resolved as part of the Last Call Review. It includes issues that were submitted outside the official review period, up to 13th January 2005.
Notation: Each original comment is tracked by a "(Change) Request" [R] designator. Each point within that original comment is identified by a point number. For example, "R5-1" is the first point in the fifth change request for the specification.
Item | Commentator | Nature | Disposition |
---|---|---|---|
R62-1 | Teemu Tingander | Change to Existing Feature | Unknown |
R63-1 | Teemu Tingander | Change to Existing Feature | Unknown |
R64-1 | Teemu Tingander | Change to Existing Feature | Unknown |
R65-1 | Teemu Tingander | Clarification / Typo / Editorial | Unknown |
R65-2 | Teemu Tingander | Change to Existing Feature | Unknown |
R66-1 | Robert Keiller | Feature Request | Unknown |
R67-1 | Robert Keiller | Change to Existing Feature | Unknown |
R104-1 | Ken Waln | Change to Existing Feature | Accepted |
R104-2 | Ken Waln | Clarification / Typo / Editorial | Accepted |
R104-3 | Ken Waln | Clarification / Typo / Editorial | Accepted |
R104-4 | Ken Waln | Clarification / Typo / Editorial | Accepted |
R104-5 | Ken Waln | Clarification / Typo / Editorial | Accepted |
R104-6 | Ken Waln | Change to Existing Feature | Accepted |
R104-7 | Ken Waln | Feature Request | Accepted |
R84-1 | Dan Connoly | Clarification / Typo / Editorial | Unknown |
R85-1 | Dan Connoly | Change to Existing Feature | Accepted |
R106-1 | Janina Sajka | Clarification / Typo / Editorial | Accepted |
R103-1 | Tobias Gobel | Clarification / Typo / Editorial | Accepted |
R86-1 | Dominique Hazael-Massieux | Clarification / Typo / Editorial | Accepted |
R87-1 | Dominique Hazael-Massieux | Clarification / Typo / Editorial | Accepted |
R88-1 | Dominique Hazael-Massieux | Clarification / Typo / Editorial | Accepted |
R89-1 | Dominique Hazael-Massieux | Clarification / Typo / Editorial | Accepted |
R90-1 | Dominique Hazael-Massieux | Clarification / Typo / Editorial | Accepted |
R91-1 | Dominique Hazael-Massieux | Clarification / Typo / Editorial | Accepted |
R92-1 | Dominique Hazael-Massieux | Clarification / Typo / Editorial | Accepted |
R93-1 | Dominique Hazael-Massieux | Clarification / Typo / Editorial | Accepted |
R94-1 | Dominique Hazael-Massieux | Clarification / Typo / Editorial | Accepted |
R105-1 | James Wilson | Clarification / Typo / Editorial | Unknown |
R107-1 | Martin Duerst | Clarification / Typo / Editorial | Accepted |
R107-2 | Martin Duerst | Clarification / Typo / Editorial | Accepted |
R107-3 | Martin Duerst | Clarification / Typo / Editorial | Accepted |
R107-4 | Martin Duerst | Change to Existing Feature | Accepted |
From Teemu Tingander:
As a general comment for <data> elements DOM mapping; I don't see why we should add more complex programming capabilities into voicexml and once again make it possible to move the complex application logic into UI side !
Resolution: rejected
Email Trail:
From Ken Waln:
In section 3, I do not see the value of this construct. The example shows it being used to pass a parameter to the script being included, but since the script functions by definition accept parameters, why pass in a parameter to load different scripts? In general this smacks of self-modifying code a little bit. If you are going to call functions in a script file, the function definitions should be included statically. Maybe another example could convince me. The best example I can come up with would be a set of language specific includes, but I think that can be handled better in other ways as well.
Resolution: accepted
Email Trail:
From Ken Waln:
I agree with comments that the data element seems to encourage designing far too much of an application as client-side script instead of using an n-tier model. I would prefer it not be added. At a minimum it should be optional as it should not be encouraged as an appropriate design pattern.
Resolution: rejected
Email Trail:
From Ken Waln:
If the data element is needed, I think the VBWG should avoid defining its own data access protocol like this. The current design encourages too much interdependency between the VoiceXML document and the XML service. Defining a new protocol like this opens up lots of work to be done in the areas of versioning, security, etc. I recommend replacing it with a mechanism to call SOAP web service (if it is left in at all) with clearly specified parameters and return values.
Resolution: rejected
Email Trail:
From Ken Waln:
The access control on the data element does not seem to be very secure. It seems to assume the browser is a trusted entity (since the credentials are fetched along with any sensitive data, the browser already has the sensitive data). I suppose it is trying to protect against malicious VoiceXML in a hosted environment, but that is only one deployment option for a VoiceXML browser. I think any security needs to be removed from the XML and moved into lower levels of the protocol. Perhaps supplying credentials for a web server level validation is enough.
Resolution: rejected
Email Trail:
From Dan Connoly:
The only reference I can find to HTML4 in the text is "the <script> element allows the specification of a block of client-side scripting language code, and is analogous to the [HTML4] <SCRIPT> element." -- http://www.w3.org/TR/2004/WD-voicexml21-20040728/#sec-script_expr
That looks informative, to me.
Why is HTML4 listed among the normative references?
Resolution: accepted
Email Trail:
From Janina Sajka:
On behalf of the Protocols and Formats Working Group (WAI) We are concerned that the security provisions specified in Appendix E, "Securing access to <data>" would negatively impact accessibility.
It is reasonable to believe that various agencies and service organizations might create specialized scripts to better meet the interface needs of certain populations of persons with disabilities who cannot directly use a voice-based service without special accomodation. Indeed, we believe such enhanced interfaces could provide access to information and services were it does not exist today. Protecting this opportunity is important.
The mechanism outlined in Appendix E, however, tends to limit access to organizations known to the organization hosting the VoiceXML application. Agencies serving persons with disabilities, however, are likely to be unknown and of lesser commercial impact. It is likely, therefore, that agencies serving persons with disabilities would find it dificult to be listed.
Furthermore, the mechanism specified in Appendix E would require agencies serving persons with disabilities to seek listing with every VoiceXML application host individually. This is burdensome and likely to result in spotty accessibility support at best.
We would suggest the security control provisions be reconsidered to provide for a authenticated access vouched and certified by a third-party trust broker. While such services may not be commonplace today, we believe numerous use case scenarios exist for such services--beyond the current instance.
Resolution: accepted
Email Trail:
From Tobias Gobel:
having studied the WD for VXML 2.1 a bit, I came across a problem concerning the new <mark> element support.
As for the attribute "marktime", the spec says:
"The number of milliseconds that elapsed since the last mark was executed by the SSML processor until barge-in occurred or the end of audio playback occurred. If no mark was executed, this variable is undefined."
Does this mean that if the caller does no barge-in, marktime is the same as the duration of the prompt itself ("...until the end of audio playback occurred")? I wonder why, since this would not allow to check how long it took the caller to react to a prompt. What I want, e.g. in order to be able to draw conclusions from reaction times to caller status (first caller, power user etc.), is to know the time that elapsed since the end of the last prompt (until "timeout" elapses or caller says something).
This could be achieved either by not stopping the timer after the end of audio playback, or by allowing to set an additional marker at the very end of a prompt and check markname and marktime of *this* marker later on (e.g. in the <filled> section). If the latter IS possible, you might want to adapt the spec slightly, pointing to this possibility.
Resolution: deferred
Email Trail:
From Dominique Hazael-Massieux:
the conformance section of the document uses terms like 'may', 'must', 'recommended', etc, but without reference to RFC 2119 nor is there any definition of how these should be interpreted; is that on purpose?
Resolution: accepted
Email Trail:
From Dominique Hazael-Massieux:
the conformance labels (VoiceXML document, VoiceXML processor) don't make references to the version of VoiceXML; is that intended?
Resolution: accepted
Email Trail:
From Dominique Hazael-Massieux:
it's not obvious from reading voicexml2.0 (nor voicexml2.1) what a voicexml processor should do with a VXML document with a version that it doesn't know; if it should throw an error, I wonder how this relates to the claim that VoiceXML2.1 is backwards compatible with VoiceXML2.0
Resolution: accepted
Email Trail:
From Dominique Hazael-Massieux:
it's not clear which sections are normative and which are simply informative
Resolution: accepted
Email Trail:
From Dominique Hazael-Massieux:
the notion of XML well-formed document is bound to XML 1.0 in the spec; is there any discussion on accepting also XML 1.1?
Resolution: deferred
Email Trail:
From Dominique Hazael-Massieux:
the references to XML 1.0 are outdated (latest version is from February 2004)
Resolution: accepted
Email Trail:
From Dominique Hazael-Massieux:
this may be planned for an more advanced draft, but having a table with all the elements and attributes defined by VoiceXML 2.1 would be great (like in HTML 4.01 [3])
Resolution: accepted
Email Trail:
From Dominique Hazael-Massieux:
the example in section 9.3 is not well-formed (missing ending '>' in the root element) [this was found out by extracting the examples from the spec using an XSLT [4]; when the schema/dtd are published, it would be nice to re-use this trick to check that the examples and the formal languages are in sync]
Resolution: accepted
Email Trail:
From Dominique Hazael-Massieux:
data_sec: is there any reason why this is done in a processing instruction? process instructions aren't very scalable, have an odd place in the XML infoset, among other things... It looks to me like this security mechanism would be better addressed in a different place altogether - e.g. it would be more scalable to have a way to link to a security policy, rather than (or in addition to?) embedding in the document itself.
Resolution: accepted
Email Trail:
From James Wilson:
I have a question. Please can you explain why the mechanism defined in "7. Recording User Utterances While Attempting Recognition" returns a binary waveform rather than a URL that points to the waveform? Is it to avoid firewall issues?
This is in the context of MRCP 2 where a mechanism is provided to save waveforms on a recognition by recognition basis (using the save_waveform parameter). The waveforms are saved on the rec server. The MRCP recognition result does not return the binary waveform to the browser, but a URL that points to it. This would appear to be more efficient.
Resolution: rejected
Email Trail:
From Martin Duerst:
Abstract: "VoiceXML 2.1 specifies a set of features commonly implemented by Voice Extensible Markup Language platforms. This specification is designed to be fully backwards-compatible with VoiceXML 2.0 [VXML2]."
It is not clear to the reader quickly enough that this specification only describes a diff between VoiceXML 2.1 and VoiceXML 2.0. This should be made much clearer.
Resolution: accepted
Email Trail:
From Martin Duerst:
Appendix C: "A conforming VoiceXML document is a well-formed [XML] document that requires only the facilities described as mandatory in this specification and in [VXML2]."
Similar confusion as above. Either VoiceXML 2.1 is the diff, or it is the result of additions. But not both.
Resolution: rejected
Email Trail:
From Martin Duerst:
Section 2, street example: In usual Web browsers, for internationalization reasons, usually 'address1', 'address2', are used. Is there such practice for Voice applications? If not, how are addresses in various locations around the world handled? It would be highly desirable if this example were fixed so that it could be used as good practice worldwide. Same for citystate.
Resolution: accepted
Email Trail:
None.
From Teemu Tingander:
Chapter 2. Referencing Grammars Dynamically.
I propose the use of attribute srcexpr in <grammar> element. This will leave the expr attribute to be used to evaluate the "grammar" content from javascript content etc. Especially this is handy when data is introduced !
Resolution: accepted
Email Trail:
From Teemu Tingander:
Chapter 3 Referencing Scripts Dynamically'
Once again I propose attribute srcexpr just to make difference between value for element and 'value that evaluaes to attribute value'..
Resolution: accepted
Email Trail:
From Teemu Tingander:
Chapter 3 Using <data> to Fetch XML Without Requiring a Dialog Transition
Once again I propose attribute srcexpr. Expr attribute could be used as it is in var. As data is clearly a some kind of extension of <var> element.
Resolution: accepted
Email Trail:
From Teemu Tingander:
Using DOM in <data> is far to complex. I suggest of finding some more simplified structure for returned data. We could use a simple pattern like..
<data name="temp" src....
and as returned:
<data> <property name="a" expr="1"> <property name="b" expr="-1"> <property name="c[0]" expr="'temp'"> <property name="c[1]" expr="'tester'"> </data>
This could then be mapped
into javascript
temp { a = 1; b = -1; c = { [0] = 'temp' [1] = 'tester' } }
And so on.. its easy to use it in this way.. Somehow this could be made in VXML 2.0 with script element too.. Or even use that same mapping we use in SSML to field values
Resolution: rejected
Email Trail:
From Robert Keiller:
Teemu Tingander raises a good question about the naming of the expr attributes for <script> and <grammar>. (Logically the expr attributes on <audio>, <next> and <submit> should also be srcexpr. <subdialog> already uses srcexpr, but expr in this case is an asignment of the subdialog variable, not a definition of the subdialog fetch.) Even if there is no immediate intention to support dynamic grammars via <grammar expr="..."/> where expr evaluates to an actual grammar, it seems a mistake to close off that possibility in future.
Resolution: accepted
Email Trail:
From Ken Waln:
In Section 2, I agree with the comments on this list that "srcexpr" is a better attribute name, both for consistency and in case someday it is desired to use an expression to be the content of the element rather than the source. I would not advocate adding the "expr" attribute in addition as I'd rather see a cleaner way of handling dynamic grammars than using script to put the entire grammar into a variable. How about allowing <value> in an inline XML grammar (although I realize this is more of an SRGS problem at that point or at least there would be an interaction)?
Resolution: accepted
Email Trail:
From Ken Waln:
Section 9 - "consultation" implies that a dialog occurs on the second call leg. If we want to allow that feature, I would also add a <connect> tag to complete the transfer. I think "monitored" or "monitoredblind" might describe it better. An alternative would be to drop this proposal and instead add an "answermode" attribute with values "immediate", "startvoice", "endvoice" etc. There are a lot of variations on single-line transfers and deciding when a call is complete. Far-end answer is not well defined in general, depending on the protocol - our platform currently offer these choices as configuration parameters but sometimes it is necessary to set on a call by call basis (e.g. an international number might behave differently).
Resolution: deferred
Email Trail:
From Dan Connoly:
I'm surprised by... "If the XML document specifies an processing instruction, access to the data is allowed based on the following algorithm: ..." -- http://www.w3.org/TR/2004/WD-voicexml21-20040728/#sec-data-security
Last time a processing instruction was used in a W3C spec, it was allowed only after considerable debate...
"The use of XML processing instructions in this specification should not be taken as a precedent. The W3C does not anticipate recommending the use of processing instructions in any future specification." -- http://www.w3.org/1999/06/REC-xml-stylesheet-19990629/
I suggest using a namespace-qualified element or attribute instead.
Resolution: accepted
Email Trail:
From Martin Duerst:
URIs: The XML Schema at http://www.w3.org/TR/2004/WD-voicexml21-20040728/vxml-datatypes.xsd containing the segment:
<xsd:simpleType name="URI.datatype"> <xsd:annotation> <xsd:documentation>URI (RFC2396)</xsd:documentation> </xsd:annotation> <xsd:restriction base="xsd:anyURI"/> </xsd:simpleType>
seems to try to restrict anyURIs used in VXML to URIs only. However, there are two problems with this approach:
- This is a very poor way of trying to make this restriction, if the restriction is indeed to be made, an actual pattern should be specified.
- Such a restriction would rule out the use of IRIs, which would be a very bad idea with respect to internationalization.
So we request that you:
- (possibly) add a restriction that just removes space and a few other ASCII characters allowed in anyURI, but neither in URIs nor in IRIs.
- Say clearly in the spec that wherever the term URI is used, this isn't restricted to ASCII only, but follows IRIs.
Resolution: accepted
Email Trail:
From Robert Keiller:
I am slightly disappointed that the support for <mark> does not go further and support client side audio control. application.lastresult$.marktime will support very simple audio control by sending the marktime as a url parameter in an audio request and having the application server apply offsets to the original audio file. However, there are two important cases where this will not work:
- TTS prompts
- where several audio files have been queued together (putting a mark on every prompt in the queue and restarting the prompt queue from the last mark would be very awkward solution) I believe that several voice browsers already support greater functionality via non-standard extensions and it seems a pity that these could not be standardised in VoiceXML 2.1.
Resolution: deferred
Email Trail:
From Ken Waln:
Could add more event values for completion: "SIT" (special information tone), "answeringmachine", etc. Are these implicitly allowed as platform specific return values or does it need to be explicit?
Resolution: deferred
Email Trail: