W3C

HTML Speech XG
Use Cases and Requirements

W3C Draft 3 February 2011

This version:
This Version
Latest version:
Current Version
Previous version:
Previous Version
Editors:
Michael Bodell, Microsoft mbodell@microsoft.com (Editor-in-Chief)
Bjorn Bringert, Google bringert@google.com
Robert Brown, Microsoft Robert.Brown@microsoft.com
Dave Burke, Google daveburke@google.com
Dan Burnett, Voxeo dburnett@voxeo.com
Deborah Dahl, W3C Invited Experts dahl@conversational-technologies.com
Chaitanya Gharpure, Google chaitanyag@google.com
Eric Johansson, ???esj@harvee.org
Michael Johnston, AT&T johnston@research.att.com
James Larson, W3C Invited Experts jim@larson-tech.com
Anthapu Reddy, Openstream areddy@openstream.com
Satish Sampath, Google satish@google.com
Marc Schroeder, German Research Center for Artificial Intelligence (DFKI) Gmbh marc.schroeder@dfki.de
Milan Young, Nuance Communications milan.young@nuance.com
E.J. Zufelt, ??? lists@zufelt.ca

Abstract

This document specifies usage scenarios, goals and requirements for incorporating speech technologies into HTML. Speech technologies include both speech recognition and related technologies as well as speech synthesis and related technologies.

Status of this document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This is the 3 February 2011 Internal Working Draft of the Use cases and Requirements for the HTML Speech Incubator. This document is produced from work by the W3C HTML Speech Incubator Group.

Publication as a Working Draft does not imply endorsement by the W3C Membership. This is an internal draft document and may not even end up being officially published. It may also be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

Table of contents



1 Introduction

The mission of the HTML Speech Incubator Group, part of the Incubator Activity, is to determine the feasibility of integrating speech technology in HTML5 in a way that leverages the capabilities of both speech and HTML (e.g., DOM) to provide a high-quality, browser-independent speech/multimodal experience while avoiding unnecessary standards fragmentation or overlap. This document represents the efforts of the HTML Speech Incubator Group to collect and review use cases and requirements. These use cases and requirements were collected on the groups public email alias, were then collated into one large list, and then refactored and structured into this document. These use cases and requirements were then refined into what were called the first pass requirements. To judge how important each requirement is and how much support there was for the requirement in the broader community these first pass requirements were then prioritized. These prioritized requirements will help inform the discussion around the various proposals that the incubator group expects to receive, discuss, and produce. And while these prioritized requirements will be the framework through which the group will judge future change requests and proposals, not every use case or requirement will necessarily be handled by the proposals presented in the incubator group's final report.

2 Use cases

Speech technologies can be used to improve existing HTML applications by allowing for richer user experiences and enabling more natural modes of interaction. The use cases listed here are ones raised by the Incubator Group members and may not be an exhaustive list. The use cases must still be prioritized and not every aspect of every use case will necessarily be supported by the end proposal of the Incubator Group. Rather this set of use cases is supposed to illustrate an interesting cross section of use cases that are deemed important by some Incubator Group members. The use cases are organized around if they are primarily speech recognition only, primarily speech synthesis only, or integrated with both input and output.

2.1 Speech Recognition

The following use case all depend primarily on speech recognition. Sometimes the output of the result of the recognition might be ambiguous and could be represented visually or could be represented with synthesized speech, but the primary purpose of the use case is speech recognition.

U1. Voice Web Search

The user can speak a query and get a result.

U2. Speech Command Interface

A Speech Command and Control Shell that allows multiple commands, many of which may take arguments, such as "call <number>", "call <person>", "calculate <math expression>", "play <song>", or "search for <query>".

U3. Domain Specific Grammars Contingent on Earlier Inputs

A use case exists around collecting multiple domain specific inputs sequentially where the later inputs depend on the results of the earlier inputs. For instance, changing which cities are in a grammar of cities in response to the user saying in which state they are located.

U4. Continuous Recognition of Open Dialog

This use case is to collect free form spoken input from the user. This might be particularly relevant to an email system, for instance. When dictating an email, the user will continue to utter sentences until they're done composing their email. The application will provide continuous feedback to the user by displaying words within a brief period of the user uttering them. The application continues listening and updating the screen until the user is done. Sophisticated applications will also listen for command words used to add formatting, perform edits, or correct errors.

U5. Domain Specific Grammars Filling Multiple Input Fields

Many web applications incorporate a collection of input fields, generally expressed as forms, with some text boxes to type into and lists to select from, with a "submit" button at the bottom. For example, "find a flight from New York to San Francisco on Monday morning returning Friday afternoon" might fill in a web form with two input elements for origin (place & date), two for destination (place & time), one for mode of transport (flight/bus/train), and a command (find) for the "submit" button. The results of the recognition would end up filling all of these multiple input elements with just one user utterance. This application is valuable because the user just has to initiate speech recognition once to complete the entire screen.

U6. Speech UI present when no visible UI need be present

Some speech applications are oriented around determining the user's intent before gathering any specific input, and hence their first interaction may have no visible input fields whatsoever, or may accept speech input that is far less constrained than the fields on the screen. For example, the user may simply be presented with the text "how may I help you?" (maybe with some speech synthesis or an earcon), and then utter their request, which the application analyzes in order to route the user to an appropriate part of the application. This isn't simply selection from a menu, because the list of options may be huge, and the number of ways each option could be expressed by the user is also huge. In any case, the speech UI (grammar) is very different from whatever input elements may or may not be displayed on the screen. In fact, there may not even be any visible non-speech input elements displayed on the page.

U7. Rerecognition

Some sophisticated applications will re-use the same utterance in two or more recognitions turns in what appears to the user as one turn. For example, an application may ask "how may I help you?", to which the user responds "find me a round trip from New York to San Francisco on Monday morning, returning Friday afternoon". An initial recognition against a broad language model may be sufficient to understand that the user wants the "flight search" portion of the app. Rather than get the user to repeat themselves, the application will just re-use the existing utterance for the recognition on the flight search recognition.

U8. Voice Activity Detection

Automatic detection of speech/non-speech boundaries is needed for a number of valuable user experiences such as "Push once to talk" or "hands-free dialog". In press-once to talk the user manually interacts with the app to indicate that the app should start listening. For example, they raise the device to their ear, press a button on the keypad, or touch a part of the screen. When they're done talking, the app automatically performs the speech recognition without the user needing to touch the device again. In hands-free dialog, where the user can start and stop talking without any manual input to indicate when the application should be listening. The application and/or browser needs to automatically detect when the user has started talking, so it can initiate speech recognition. This is particularly useful for in-car, or 10-foot usage (e.g. living room), or for people with disabilities.

2.2 Speech Synthesis

The following use case all depend primarily on speech synthesis. Sometimes the way the input to the synthesis might be ambiguous and could be the result of a spoken utterance, but the primary purpose of the use case is speech synthesis.

U9. Temporal Structure of Synthesis to Provide Visual Feedback

The application may wish to visually highlight the word or phrase that the application is synthesizing. Or, alternatively, the visual application may wish to coordinate the synthesis with animations of an avatar speaking or with appropriately timed slide transitions and thus need to know where in the reading of the synthesized text the application currently is. In addition, the application may wish to know where in a piece of synthesized text an interruption occurred and use the temporal feedback to tell.

U10. Hello World Use Case

The web page when loaded may wish to say a simple phrase of synthesized text such as "hello world".

2.3 Integrated Speech Recognition and Synthesis

The following use case all depend on both speech recognition and on speech synthesis. These richer multimodal use cases may be more dominant on one modality, but fundamentally having both input and output is important to the use case.

U11. Speech Translation

The application can act as a translator between two individuals fluent in different languages. The application can listen to one speaker and understand the utterances in one language, can translated the spoken phrases to a different language, and then can speak the translation to the other individual.

U12. Speech Enabled Email Client

The application reads out subjects and contents of email and also listens for commands, for instance, "archive", "reply: ok, let's meet at 2 pm", "forward to bob", "read message". Some commands may relate to VCR like controls of the message being read back, for instance, "pause", "skip forwards", "skip back", or "faster". Some of those controls may include controls related to parts of speech, such as, "repeat last sentence" or "next paragraph".

U13. Dialog Systems

The type of dialogs that allow for collecting multiple pieces of information in either one turn or sequential turns in response to frequently synthesized prompts. Types of dialogs might be around ordering a pizza or booking a flight route complete with the system repeating back the choices the user said. This dialog system may well be represented by a VXML form or application that allows for control of the dialog. The VXML dialog may be fetched using XMLHttpRequest.

U14. Multimodal Interaction

The ability to mix and integrate input from multiple modalities such as by saying "I want to go from here to there" while tapping two points on a touch screen map.

U15. Speech Driving Directions

A direction service that speaks turn-by-turn directions. Accepts hands-free spoken instructions like "navigate to <address>" or "navigate to <business listing>" or "reroute using <road name>". Input from the location of the user may help the service know when to play the next direction. It is possible that user is not able to see any output so the service needs to regularly synthesize phrases like "turn left on <road> in <distance>".

3 First Pass Requirements

This section covers group consensus of a clearer expansion of the requirements that were developed in Appendix A. These are not presented in order of prioritization, that order is provided in section 4. Including a requirement in this space certainly does not necessarily mean everyone in the group agrees that it is an important requirement than MUST be addressed. The requirements are not in numberic order since these requirements have been organized conceptually and have evolved over time from the initial requirements found in Appendix A and any renumbering may be confusing when considering existing and historic discussions about the requirements.

3.1 Web Authoring Feature Requirements

This section covers requirements related to fundamental functionality required for scenarios and usecases.

3.1.1 Web Authoring Feature Speech System Requirements

This section covers web authoring feature requirements that are related to the whole speech system, that is to both the speech recognition and the speech synthesis system.

FPR8. User agent (browser) can refuse to use requested speech service.

This is part of the expansion of requirements 1, 15, 16, 22, and 31. It is expected that user agents should not refuse in the common case.

FPR11. If the web apps specify speech services, it should be possible to specify parameters.

This is part of the expansion of requirements 1, 15, 16, 22, and 31.

FPR12. Speech services that can be specified by web apps must include network speech services.

This is part of the expansion of requirements 1, 15, 16, 22, and 31.

FPR31. User agents and speech services may agree to use alternate protocols for communication.

This is a part of the expansion of requirement 18.

FPR32. Speech services that can be specified by web apps must include local speech services.

This is a part of the expansion of requirement 18.

FPR33. There should be at least one mandatory-to-support codec that isn't encumbered with IP issues and has sufficient fidelity & low bandwidth requirements.

This is a part of the expansion of requirement 18.

FPR40. Web applications must be able to use barge-in (interrupting audio and TTS output when the user starts speaking).

Notification of barge-in should be delivered in a timely manner after detection occurs. This is a part of the expansion of requirement 14.

FPR58. Web application and speech services must have a means of binding session information to communications.

This is a part of the expansion of mailing list discussions.

3.1.2 Web Authoring Feature Recognition Requirements

This section covers web authoring feature requirements that are related primarily to the speech recognition system.

FPR2. Implementations must support the XML format of SRGS and must support SISR.

This is one part of the evolution of requirement 27.

FPR4. It should be possible for the web application to get the recognition results in a standard format such as EMMA.

This is one part of the evolution of requirement 27.

FPR19. User-initiated speech input should be possible.

This is a part of the expansion of requirement 33 and requirement 29.

FPR21. The web app should be notified that capture starts.

Notification should be delivered in a timely manner after detection occurs. This is a part of the expansion of requirement 4.

FPR22. The web app should be notified that speech is considered to have started for the purposes of recognition.

Notification should be delivered in a timely manner after detection occurs. This is a part of the expansion of requirement 4.

FPR23. The web app should be notified that speech is considered to have ended for the purposes of recognition.

Notification should be delivered in a timely manner after detection occurs. This is a part of the expansion of requirement 4.

FPR24. The web app should be notified when recognition results are available.

Notification should be delivered in a timely manner after detection occurs. These results may be partial results and may occur several times. This is a part of the expansion of requirement 4.

FPR25. Implementations should be allowed to start processing captured audio before the capture completes.

This is a part of the expansion of requirement 17.

FPR26. The API to do recognition should not introduce unneeded latency.

This is a part of the expansion of requirement 17.

FPR27. Speech recognition implementations should be allowed to add implementation specific information to speech recognition results.

This is a part of the expansion of requirement 18.

FPR28. Speech recognition implementations should be allowed to fire implementation specific events.

This is a part of the expansion of requirement 18.

FPR34. Web application must be able to specify domain specific custom grammars.

This is a part of the expansion of requirement 7.

FPR35. Web application must be notified when speech recognition errors or non-matches occur.

The intent is that this requirement covers errors like bad format of grammars but also covers no input and no matches. Notification should be delivered in a timely manner after detection occurs. This is a part of the expansion of requirement 5.

FPR42. It should be possible for user agents to allow hands-free speech input.

This is a part of the expansion of requirement 24.

FPR43. User agents should not be required to allow hands-free speech input.

This is a part of the expansion of requirement 24.

FPR47. When speech input is used to provide input to a web app, it should be possible for the user to select alternative input methods.

This is a part of the expansion of requirement 23.

FPR48. Web application author must be able to specify a domain specific statistical language model.

This is a part of the expansion of requirement 12.

FPR50. Web applications must not be prevented from integrating input from multiple modalities.

This is a part of the expansion of requirement 11.

FPR54. Web apps should be able to customize all aspects of the user interface for speech recognition, except where such customizations conflict with security and privacy requirements in this document, or where they cause other security or privacy problems.

Other security and privacy requirements include the following FPR: 1, 10, 16, 17, 18, and 20. This is a part of the expansion of requirement 13.

FPR56. Web applications must be able to request NL interpretation based only on text input (no audio sent).

This is a part of the expansion of mailing list discussions.

FPR57. Web applications must be able to request recognition based on previously sent audio.

This is a part of the expansion of mailing list discussions.

FPR59. While capture is happening, there must be a way for the web application to abort the capture and recognition process.

This is a part of the expansion of mailing list discussions.

3.1.3 Web Authoring Feature Synthesis Requirements

This section covers web authoring feature requirements that are related primarily to the speech synthesis system.

FPR3. Implementation must support SSML.

This is one part of the evolution of requirement 27.

FPR29. Speech synthesis implementations should be allowed to fire implementation specific events.

This is a part of the expansion of requirement 18.

FPR41. It should be easy to extend the standard without affecting existing speech applications.

This is a part of the expansion of requirement 25.

FPR46. Web apps should be able to specify which voice is used for TTS.

This is a part of the expansion of requirement 20.

FPR51. The web app should be notified when TTS playback starts.

Notification should be delivered in a timely manner after playback begins. This is a part of the expansion of requirement 9.

FPR52. The web app should be notified when TTS playback finishes.

Notification should be delivered in a timely manner after playback finishes. This is a part of the expansion of requirement 9.

FPR53. The web app should be notified when the audio corresponding to a TTS <mark> element is played back.

Notification should be delivered in a timely manner after the mark is played back. This is a part of the expansion of requirement 9.

FPR60. Web application must be able to programatically abort tts output.

This is a part of the expansion of mailing list discussions.

3.2 Web Authoring Convenience Requirements

This section covers requirements related to making web authoring easy and convenient or exposing a recognition feature in a way that is consistent with web technologies.

3.2.1 Web Authoring Convenience Speech System Requirements

This section covers web authoring convenience requirements that are related to the whole speech system, that is to both the speech recognition and the speech synthesis system.

FPR7. Web apps should be able to request speech service different from default.

This is part of the expansion of requirements 1, 15, 16, 22, and 31.

FPR9. If browser refuses to use the web application requested speech service, it must inform the web app.

This is part of the expansion of requirements 1, 15, 16, 22, and 31.

FPR10. If browser uses speech services other than the default one, it must inform the user which one(s) it is using.

Note by "the default one" we mean the user agent default speech service. This is part of the expansion of requirements 1, 15, 16, 22, and 31.

FPR30. Web applications must be allowed at least one form of communication with a particular speech service that is supported in all UAs.

This evolved from: "The communication between the user agent and the speech server must require a mandatory-to-support lowest common denominator such as HTTP 1.1, TBD" which is part of the expansion of requirement 18.

3.2.2 Web Authoring Convenience Recognition Requirements

This section covers web authoring convenience requirements that are related primarily to the speech recognition system.

FPR5. It should be easy for the web appls to get access to the most common pieces of recognition results such as utterance, confidence, and nbests.

This is one part of the evolution of requirement 27.

FPR6. Browser must provide default speech resource.

This is part of the expansion of requirements 1, 15, 16, 22, and 31.

FPR36. User agents must provide a default interface to control speech recognition.

This is a part of the expansion of requirement 26.

FPR38. Web application must be able to specify language of recognition.

This is a part of the expansion of requirement 8.

FPR39. Web application must be able to be notified when the selected language is not available.

Notification should be delivered in a timely manner after detection occurs. This is a part of the expansion of requirement 8.

FPR44. Recognition without specifying a grammar should be possible.

I.e., free-form recognition. This is a part of the expansion of requirement 2.

FPR45. Applications should be able to specify the grammars (or lack thereof) separately for each recognition.

This is a part of the expansion of requirement 2.

3.2.3 Web Authoring Convenience Synthesis Requirements

This section covers web authoring convenience requirements that are related primarily to the speech synthesis system.

FPR13. It should be easy to assign recognition results to a single input field.

This is a part of the expansion of requirement 3.

FPR14. It should not be required to fill an input field every time there is a recognition result.

This is a part of the expansion of requirement 3.

FPR15. It should be possible to use recognition results to multiple input fields.

This is a part of the expansion of requirement 3.

FPR61. Aborting the TTS output should be efficient.

This is a part of the expansion of mailing list discussions.

3.3 Security and Privacy Requirements

This section covers requirements related to ensuring speech is allowed in a way that is in line with desired security and privacy requirements.

3.3.1 Security and Privacy Speech System Requirements

This section covers requirements related to ensuring speech in a speech system, that is with both the speech recognition and speech synthesis systems, is allowed in a way that is in line with desired security and privacy requirements.

FPR16. User consent should be informed consent.

This is a part of the expansion of requirement 33.

FPR20. The spec should not unnecessarily restrict the UA's choice in privacy policy.

This is a part of the expansion of requirement 33 and requirement 29.

FPR55. Web application must be able to encrypt communications to remote speech service.

This is a part of the expansion of mailing list discussions.

3.3.2 Security and Privacy Recognition Requirements

This section covers requirements related to ensuring speech in a recognition system is allowed in a way that is in line with desired security and privacy requirements.

FPR1. Web applications must not capture audio without the user's consent.

This is the evolution of requirement 29 and requirment 33.

FPR17. While capture is happening, there must be an obvious way for the user to abort the capture and recognition process.

This is a part of the expansion of requirement 33. Here the word abort should mean "as soon as you can, stop capturing, stop processing for recognition, and stop processing any recognition results".

FPR18. It must be possible for the user to revoke consent.

This is a part of the expansion of requirement 33.

FPR37. Web application should be given captured audio access only after explicit consent from the user.

This is a part of the expansion of requirement 28.

FPR49. End users need a clear indication whenever microphone is listening to the user

This is a part of the expansion of requirement 32.

3.3.3 Security and Privacy Synthesis Requirements

This section covers requirements related to ensuring speech in a synthesis system is allowed in a way that is in line with desired security and privacy requirements.

4. Prioritization of Requirements

The HTML Speech incubator group measured industry interest in the importance of the various requirements by surveying the membership. The complete results are available here.

A summary of the results are presented below with requirements listed in priority order, and segmented in to those with strong interest, those with moderate interest, and those with mild interest.

4.1 Strong Interest

The strong interest requirements were ones that had at least 80% of the group believing needed to be addressed by any specification developed based on the work of this group. These requirements are:

4.1 Moderate Interest

The moderate interest requirements were ones that had less than 80% but at least 50% of the group believing needed to be addressed by any specification developed based on the work of this group. These requirements are:

4.3 Mild Interest

The mild interest requirements were ones that had less than 50% of the group believing needed to be addressed by any specification developed based on the work of this group. These requirements are:

Appendix

A Initial Requirements

The use cases motivate a number of requirements for integrating speech into HTML. The Incubator Group has members that initially felt that each of the requirements described below are essential to the language; however, this represents the requirements before any group evaluation, rewording, and prioritization of these requirements. Each requirement should include a short description and should be motivated by one or more use cases from the previous section (not all use cases may be listed). For convenience, the requirements are organized around different high level themes.

A.1 Web Authoring Feature Requirements

The following requirements are around features that HTML web authors require to build speech applications.

R1. Web author needs full control over specification of speech resources

The HTML web author must have control over specifying both the speech recognizing technology used and the speech parameters that go to the recognizer. In particular, this also means that it must be possible to do recognition on a networked speech recognizer and this should also mean that it is possible to have any user agent work with any vendor's speech services provided the specified open protocols are used. Also, any recognizer parameters or hints must be able to be specified by the web application author.

Relevant Use Cases Include: U1 Voice Web Search, U3 Domain Specific Grammars Contingent on Earlier Inputs, U5 Domain Specific Grammars Filling Multiple Input Fields, U7 Rerecognition, U11 Speech Translation.

R2. Application change from directed input to free form input

An application may be required to switch between a grammar based recognition to free form recognition. For example for simple date, yes/no, quantity etc grammar based reco might work fine. But for filling a comments section etc, we might want to use a free form recognizer.

Relevant Use Cases Include: U3 Domain Specific Grammars Contingent on Earlier Inputs, U4 Continuous Recognition of Open Dialog

R3. Ability to bind results to specific input fields

An application wants the results of matching a particular grammar or speech turn to fill a particular input field.

Relevant Use Cases Include: U3 Domain Specific Grammars Contingent on Earlier Inputs, U5 Domain Specific Grammars Filling Multiple Input Fields, U13 Dialog Systems, U14 Multimodal Interaction, U15 Speech Driving Directions.

R4. Web application must be notified when recognition occurs

When the speech recognition occurs the web application must be notified.

Relevant Use Cases Include: U8 Voice Activity Detection, U9 Temporal Structure of Synthesis to Provide Visual Feedback, U13 Dialog Systems, U14 Multimodal Interaction.

R5. Web application must be notified when speech recognition errors and other non-matches occur

When a recognition is attempted and either an error occurs or an utterance doesn't provide a recognition match or else the system doesn't detect speech for a sufficiently long time (I.e., noinput) the web application must be notified.

Relevant Use Cases Include: U1 Voice Web Search, U3 Domain Specific Grammars Contingent on Earlier Inputs, U5 Domain Specific Grammars Filling Multiple Input Fields, U7 Rerecognition, U8 Voice Activity Detection, U13 Dialog Systems, U14 Multimodal Interaction.

R6. Web application must be provided with full context of recognition

Because speech recognition is by its nature imperfect and probabilistic a set of additional metadata is frequently generated including n-best list of alternate suggestions, confidences or recognition results, and semantic structure represented by recognition results. All of this data must be provided to the web application.

Relevant Use Cases Include: U3 Domain Specific Grammars Contingent on Earlier Inputs, U4 Continuous Recognition of Open Dialog, U5 Domain Specific Grammars Filling Multiple Input Fields, U7 Rerecognition, U11 Speech Translation, U12 Speech Enabled Email Client, U13 Dialog Systems, U14 Multimodal Interaction, U15 Speech Driving Directions.

R7. Web application must be able to specify domain specific custom grammars

It is necessary that the HTML author must be able to specify the grammars of their choosing and must not be restricted to only use grammars natively installed in the user-agent.

Relevant Use Cases Include: U1 Voice Web Search, U3 Domain Specific Grammars Contingent on Earlier Inputs, U4 Continuous Recognition of Open Dialog, U5 Domain Specific Grammars Filling Multiple Input Fields, U7 Rerecognition, U11 Speech Translation, U12 Speech Enabled Email Client, U13 Dialog Systems, U14 Multimodal Interaction, U15 Speech Driving Directions.

R8. Web application must be able to specify language of recognition

The HTML author must be able to specify the language of the recognition to be used for any given spoken interaction. This must be the case even if the language is different that that used in the content of the rest of the web page. This also may mean multiple different spoken language input elements are present in the same web page.

Relevant Use Cases Include: U1 Voice Web Search, U3 Domain Specific Grammars Contingent on Earlier Inputs, U4 Continuous Recognition of Open Dialog, U5 Domain Specific Grammars Filling Multiple Input Fields, U11 Speech Translation.

R9. Web application author provided synthesis feedback

It is necessary that the web application author receive notification of the temporal and structural feedback of the synthesis of text.

Relevant Use Cases Include: U9 Temporal Structure of Synthesis to Provide Visual Feedback, U13 Dialog Systems, U14 Multimodal Interaction, U15 Speech Driving Directions.

R10. Web application authors need to be able to use full SSML features

When rendering synthesized speech HTML application authors need to be able to take advantage of features such as gender, language, pronunciations, etc.

Relevant Use Cases Include: U9 Temporal Structure of Synthesis to Provide Visual Feedback, U10 Hello World Use Case, U11 Speech Translation, U12 Speech Enabled Email Client, U13 Dialog Systems.

R11. Web application author must integrate input from multiple modalities

The author may have multiple inputs that must be integrated to provide a quality user experience. For instance, the application might combine information from geolocation (to understand "here"), speech recognition, and touch to provide driving directions in response to the spoken phrase "Get me directions to this place" while tapping on a map. Since new modalities are continually becoming available, so it would be difficult to provide for integration on a case by case basis in the user agent, so it must be easy for the web application author to provide the integration.

Relevant Use Cases Include: U9 Temporal Structure of Synthesis to Provide Visual Feedback, U14 Multimodal Interaction, U15 Speech Driving Directions.

R12. Web application author must be able to specify a domain specific statistical language model

A typical approach for open dialog is to provide a statistical language model (or SLM) and use that to anticipate likely user dialog.

Relevant Use Cases Include: U1 Voice Web Search, U3 Domain Specific Grammars Contingent on Earlier Inputs, U4 Continuous Recognition of Open Dialog, U6 Speech UI present when no visible UI need be present, U7 Rerecognition, U11 Speech Translation, U12 Speech Enabled Email Client, U13 Dialog Systems.

R13. Web application author should have ability to customize speech recognition graphical user interface

Multimodal speech recognition apps are typically accompanied by a GUI experience to (i) provide a means to invoke SR; and (ii) indicate progress of recognition through various states (listening to the user speak; waiting for the recognition result; displaying errors; displaying alternates; etc). Polished applications generally have their own GUI design for the speech experience. This will usually include a clickable graphic to invoke speech recognition, and graphics to indicate the progress of the recognition through various states.

Relevant Use Cases Include: U1 Voice Web Search, U3 Domain Specific Grammars Contingent on Earlier Inputs, U4 Continuous Recognition of Open Dialog, U6 Speech UI present when no visible UI need be present, U7 Rerecognition, U11 Speech Translation, U12 Speech Enabled Email Client, U13 Dialog Systems, U14 Multimodal Interaction, U15 Speech Driving Directions.

R14. Web application authors need a way to specify and effectively create barge-in (interrupt audio and synthesis)

The ability to stop output (text-to-speech or media) in response to events (user starting to speak, a recognition occurring, other events or selections or browser interactions, etc.) so that the web application user experience is acceptable and the web application doesn't appear confused or deaf to user input. While barge-in aids the usability of an application by allowing the user provide spoken input even while the application is playing media/TTS. However, applications that both speak (or play media) and listen at the same time can potentially interfere with their own speech recognition. In telephony, this is less of a problem due to the design of the handset, and built-in echo-cancelling technology. However, with broad variety of HTML-capable devices, situations that involve open-mic and open-speaker will be potentially more common. To help developers cope with this, it may be useful to either specify a minimum barge-in capability that all browsers should meet, or make it easier for developers to discover when barge-in may be an issue and allow appropriate parameter settings to help mitigate the situation.

Relevant Use Cases Include: U8 Voice Activity Detection, U9 Temporal Structure of Synthesis to Provide Visual Feedback, U11 Speech Translation, U12 Speech Enabled Email Client, U13 Dialog Systems, U14 Multimodal Interaction.

A.2 Web Author Convenience and Quality

The following requirements do not provide any additional functionality to an HTML web author, but instead make the task of authoring a speech HTML page much easier or else convert the authored application into a much more high quality application.

R15. Web application authors must not need run their own speech service

Running a speech service can be difficult, and a default speech interface/service is needed in the user agent so that a web application author can use speech resources without needing to run their own speech service.

Relevant Use Cases Include: U2 Speech Command Interface, U3 Domain Specific Grammars Contingent on Earlier Inputs, U5 Domain Specific Grammars Filling Multiple Input Fields,

R16. Web application authors must not be excluded from running their own speech service

Running a speech service can provide fine grained customization of the application for the web application author.

Relevant Use Cases Include: U1 Voice Web Search, U2 Speech Command Interface, U3 Domain Specific Grammars Contingent on Earlier Inputs, U4 Continuous Recognition of Open Dialog, U5 Domain Specific Grammars Filling Multiple Input Fields, U6 Speech UI present when no visible UI need be present, U7 Rerecognition, U8 Voice Activity Detection, U9 Temporal Structure of Synthesis to Provide Visual Feedback, U11 Speech Translation, U12 Speech Enabled Email Client, U13 Dialog Systems, U14 Multimodal Interaction, U15 Speech Driving Directions.

R17. User perceived latency of recognition must be minimized

The time between the user completing their utterance, and an application providing a response, needs to fall below an acceptable threshold to be usable. For example, "find a flight from New York to San Francisco on Monday morning returning Friday afternoon" takes about 6 seconds to say, but the user still expects a response within a couple of seconds (generally somewhere between 500 and 3000 milliseconds, depending on the specific application and audience). In the case of applications/browsers that invoke speech recognition over a network, the platform needs to support (i) using a codec that can be transmitted in real-time on the modest bandwidth of many cell networks and (ii) transmitting the user's utterance in real-time (e.g. in 100ms packets) rather than collect the full utterance before transmitting any of it. For applications where the utterances are non-trivial and the grammars can be recognized in real-time or better, real-time streaming can all but eliminate user-perceived latency.

Relevant Use Cases Include: U1 Voice Web Search, U2 Speech Command Interface, U4 Continuous Recognition of Open Dialog, U5 Domain Specific Grammars Filling Multiple Input Fields, U6 Speech UI present when no visible UI need be present, U7 Rerecognition, U8 Voice Activity Detection, U11 Speech Translation, U12 Speech Enabled Email Client, U13 Dialog Systems, U14 Multimodal Interaction, U15 Speech Driving Directions.

R18. User perceived latency of synthesis must be minimized

For longer stretches of spoken output, it may be necessary to stream the synthesis without knowing the full rendering of the TTS/SSML nor the Content-Length of the rendered audio format. To enable high quality applications user agents should support streaming of synthesis results without needing the content length header to be present with a correct full synthesized file length. I.e., consider a TTS processor that can process one sentence at a time, and is requested to read an email consisting of three paragraphs.

Relevant Use Cases Include: U9 Temporal Structure of Synthesis to Provide Visual Feedback, U10 Hello World Use Case, U11 Speech Translation, U12 Speech Enabled Email Client, U13 Dialog Systems, U14 Multimodal Interaction, U15 Speech Driving Directions.

R19. End user extensions should be available both on desktop and in cloud

End-user extensions should be accessible either from the desktop or from the cloud.

Relevant Use Cases Include:

R20. Web author selected TTS service should be available both on device and in the cloud

It should be possible to specify a target TTS engine not only via the "URI" attribute, but via a more generic "source" attribute, which can point to a local TTS engine as well. To achieve this, it'd be useful to think about extendability and flexibility of the framework, so that it is easy for third parties to provide high quality TTS engines.

Relevant Use Cases Include: U9 Temporal Structure of Synthesis to Provide Visual Feedback, U10 Hello World Use Case, U11 Speech Translation, U12 Speech Enabled Email Client, U13 Dialog Systems, U14 Multimodal Interaction, U15 Speech Driving Directions.

R21. Any public interface for creating extensions should be speakable

Any public interfaces for creating extensions should be "speakable". A user should never need to touch the keyboard in order to expand a grammar, reference data, or add functionality.

Relevant Use Cases Include:

R22. Web application author wants to provide a consistent user experience across all modalities

A developer creating a (multimodal) interface combining speech input with graphical output needs to have the ability to provide a consistent user experience not just for graphical elements but also for voice. In addition, high quality speech applications often involve a lot of tuning of recognition parameters and grammars to work with different recognition technologies. A web author may wish for her application to only need to tune the speech recognition with one technology stack, and not have to tune and special case different grammars and parameters for different user agents. There exists enough browser detection in the web developer world to deal with accidental incompatibility and legacy implementations without causing speech to require it by design for quality speech recognition. This is one reason to allow author specified networked speech services.

Relevant Use Cases Include: U1 Voice Web Search, U2 Speech Command Interface, U3 Domain Specific Grammars Contingent on Earlier Inputs, U4 Continuous Recognition of Open Dialog, U5 Domain Specific Grammars Filling Multiple Input Fields, U6 Speech UI present when no visible UI need be present, U7 Rerecognition, U8 Voice Activity Detection, U9 Temporal Structure of Synthesis to Provide Visual Feedback, U11 Speech Translation, U12 Speech Enabled Email Client, U13 Dialog Systems, U14 Multimodal Interaction, U15 Speech Driving Directions.

R23. Speech as an input on any application should be able to be optional

If the user can't speak, can't speak the language well enough to be recognized, the speech recognizer just doesn't work well for them, or they are in an environment where speaking would be inappropriate they should be able to interact with the web application some other way.

Relevant Use Cases Include: U14 Multimodal Interaction.

R24. End user should be able to use speech in a hands-free mode

There should be a way to speech-enable every aspect of a web application that you would do with a mouse, a touchscreen, or by typing.

Relevant Use Cases Include: U1 Voice Web Search, U2 Speech Command Interface, U4 Continuous Recognition of Open Dialog, U8 Voice Activity Detection, U12 Speech Enabled Email Client, U13 Dialog Systems, U14 Multimodal Interaction, U15 Speech Driving Directions.

R25. It should be easy to extend the standard without effecting existing speech applications

If recognizers support new capabilities like language detection or gender detection, it should be easy to add the results of those new capabilities to the speech recognition result, without requiring a new version of the standard.

Relevant Use Cases Include: U13 Dialog Systems, U14 Multimodal Interaction.

R26. There should exist a high quality default speech recognition visual user interface

Multimodal speech recognition apps are typically accompanied by a GUI experience to (i) provide a means to invoke SR; and (ii) indicate progress of recognition through various states (listening to the user speak; waiting for the recognition result; displaying errors; displaying alternates; etc). Many applications, at least in their initial development, and in some cases the finished product, will not implement their own GUI for controlling speech recognition. These applications will rely on the browser to implement a default control to begin speech recognition, such as a GUI button on the screen or a physical button on the device, keyboard or microphone. They will also rely on a default GUI to indicate the state of recognition (listening, waiting, error, etc).

Relevant Use Cases Include: U1 Voice Web Search, U3 Domain Specific Grammars Contingent on Earlier Inputs, U4 Continuous Recognition of Open Dialog, U6 Speech UI present when no visible UI need be present, U13 Dialog Systems, U14 Multimodal Interaction.

R27. Grammars, TTS, media composition, and recognition results should all use standard formats

No developer likes to be locked into a particular vendor's implementation. In some cases this will be unavoidable due to differentiation in capabilities between vendors. But general concepts like grammars, TTS and media composition, and recognition results should use standard formats (e.g. SRGS, SSML, SMIL, EMMA).

Relevant Use Cases Include: U1 Voice Web Search, U2 Speech Command Interface, U3 Domain Specific Grammars Contingent on Earlier Inputs, U4 Continuous Recognition of Open Dialog, U5 Domain Specific Grammars Filling Multiple Input Fields, U7 Rerecognition, U9 Temporal Structure of Synthesis to Provide Visual Feedback, U11 Speech Translation, U12 Speech Enabled Email Client, U13 Dialog Systems, U14 Multimodal Interaction, U15 Speech Driving Directions.

A.3 Security and Privacy Requirements

These requirements have to do with security, privacy, and user expectations. These often don't have specific use cases, and maybe mitigations should be explored in appropriate user agent permissions to allow some of these actions on certain trusted sites while forbidding them on others.

R28. Web application must not be allowed access to raw audio

Some users may be concerned if their audio may be recorded and then controlled by the web application author so user agents must prevent this.

Relevant Use Cases Include:

R29. Web application may only listen in response to user action

Some users may be concerned if their audio may be recognized without being aware of it so user agents must make sure that recognition only occurs in response to explicit end user actions.

Relevant Use Cases Include:

R30. End users should not be forced to store anything about their speech recognition environment in the cloud

For reasons of privacy, the user should not be forced to store anything about their speech recognition environment on the cloud.

Relevant Use Cases Include:

R31. End users, not web application authors, should be the ones to select speech recognition resources

Selection of the speech engine should be a user-setting in the browser, not a Web developer setting. The security bar is much higher for an audio recording solution that can be pointed at an arbitrary destination.

Relevant Use Cases Include:

R32. End users need a clear indication whenever microphone is listening to the user

Many users are sensitive about who or what is listening to them, and will not tolerate an application that listens to the user without the user's knowledge. A browser needs to provide clear indication to the user either whenever it will listen to the user or whenever it is using a microphone to listen to the user.

Relevant Use Cases Include:

R33. User agents need a way to enable end users to grant permission to an application to listen to them

Some users will want to explicitly grant permission for the user agent, or an application, to listen to them. Whether this is a setting that is global, applies to a subset of applications/domains, etc, depends somewhat on the security & privacy expectations of the user agent's customers.

Relevant Use Cases Include:

R34. A trust relation is needed between end user and whatever is doing recognition

The user also needs to be able to trust and verify that their utterance is processed by the application that's on the screen (or its backend servers), or at least by a service the user trusts.

Relevant Use Cases Include: