HTML Speech Incubator Group Final Report (Internal Draft)

1 Terminology

The key words MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT, SHOULD, SHOULD NOT, RECOMMENDED, MAY, and OPTIONAL in this specification are to be interpreted as described in [IETF RFC 2119].

2 Overview

This document presents the deliverables of the HTML Speech Incubator Group. First, it presents the requirements developed by the group, ordered by priority of interest of the group members. Next, it covers the use cases developed by the group. Next, it briefly describes and points to the major individual proposals sent in to the group as proof-of-concept examples to help the group be aware of both possibilities and tradeoffs. It then presents design possibilities on important topics, providing decisions where the group had consensus and alternatives where multiple strongly differing opinions existed, with a focus on satisfying the high-interest requirements. Finally, the document contains (all or some of) a proposed solution that addresses the high-interest requirements and the design decisions.

The major steps the group took in working towards API recommendations, rather than just the final decisions, are recorded to act as an aid to any future standards-track efforts in understanding the motivations that drove the recommendations. Thus, even if a final standards-track document differs from any API recommendations in this document, the final standard should address the requirements, use cases, and design decisions laid out by this Incubator Group.

3 Deliverables

According to the charter, the group is to produce one deliverable, this document. It goes on to state that the document may include

Requirements
Use cases
Change requests to HTML5 and, as appropriate, other specifications, e.g., capture API, CSS, Audio XG, EMMA, SRGS, VoiceXML 3

The group has developed requirements, some with use cases, and has made progress towards one or more API proposals that are effectively change requests to other existing standard specifications. These subdeliverables follow.

3.1 Prioritized Requirements

The HTML Speech Incubator Group developed and prioritized requirements as described in the Requirements and use cases document. A summary of the results is presented below with requirements listed in priority order, and segmented into those with strong interest, those with moderate interest, and those with mild interest. Each requirement is linked to its description in the requirements document.

3.1.1 Strong Interest

A requirement was classified as having "strong interest" if at least 80% of the group believed it needs to be addressed by any specification developed based on the work of this group. These requirements are:

3.1.2 Moderate Interest

A requirement was classified as having "moderate interest" if less than 80% but at least 50% of the group believed it needs to be addressed by any specification developed based on the work of this group. These requirements are:

3.1.3 Mild Interest

A requirement was classified as having "mild interest" if less than 50% of the group believed it needs to be addressed by any specification developed based on the work of this group. These requirements are:

3.2 New Requirements

While disucssing some of the use cases, proposals, design decisions, and possible solution a few more requirements were discovered and agreed to. These requriements are:

3.3 Use Cases

Through out this process the group has developed a lot of different use cases covering a varitey of scenarios. These use cases were developed as part of the requirements process, through the proposals that were submitted, and in our group discussion. It is important that it be possible to support as many of these use cases as easily as possible by our proposed solutions.

3.3.1 Voice Web Search

The user can speak a query and get a result.

3.3.2 Speech Command Interface

A Speech Command and Control Shell that allows multiple commands, many of which may take arguments, such as "call [number]", "call [person]", "calculate [math expression]", "play [song]", or "search for [query]".

3.3.3 Domain Specific Grammars Contingent on Earlier Inputs

A use case exists around collecting multiple domain specific inputs sequentially where the later inputs depend on the results of the earlier inputs. For instance, changing which cities are in a grammar of cities in response to the user saying in which state they are located.

3.3.4 Continuous Recognition of Open Dialog

This use case is to collect free form spoken input from the user. This might be particularly relevant to an email system, for instance. When dictating an email, the user will continue to utter sentences until they're done composing their email. The application will provide continuous feedback to the user by displaying words within a brief period of the user uttering them. The application continues listening and updating the screen until the user is done. Sophisticated applications will also listen for command words used to add formatting, perform edits, or correct errors.

3.3.5 Domain Specific Grammars Filling Multiple Input Fields

Many web applications incorporate a collection of input fields, generally expressed as forms, with some text boxes to type into and lists to select from, with a "submit" button at the bottom. For example, "find a flight from New York to San Francisco on Monday morning returning Friday afternoon" might fill in a web form with two input elements for origin (place & date), two for destination (place & time), one for mode of transport (flight/bus/train), and a command (find) for the "submit" button. The results of the recognition would end up filling all of these multiple input elements with just one user utterance. This application is valuable because the user just has to initiate speech recognition once to complete the entire screen.

3.3.6 Speech UI present when no visible UI need be present

Some speech applications are oriented around determining the user's intent before gathering any specific input, and hence their first interaction may have no visible input fields whatsoever, or may accept speech input that is far less constrained than the fields on the screen. For example, the user may simply be presented with the text "how may I help you?" (maybe with some speech synthesis or an earcon), and then utter their request, which the application analyzes in order to route the user to an appropriate part of the application. This isn't simply selection from a menu, because the list of options may be huge, and the number of ways each option could be expressed by the user is also huge. In any case, the speech UI (grammar) is very different from whatever input elements may or may not be displayed on the screen. In fact, there may not even be any visible non-speech input elements displayed on the page.

3.3.7 Rerecognition

Some sophisticated applications will re-use the same utterance in two or more recognitions turns in what appears to the user as one turn. For example, an application may ask "how may I help you?", to which the user responds "find me a round trip from New York to San Francisco on Monday morning, returning Friday afternoon". An initial recognition against a broad language model may be sufficient to understand that the user wants the "flight search" portion of the app. Rather than get the user to repeat themselves, the application will just re-use the existing utterance for the recognition on the flight search recognition.

3.3.8 Voice Activity Detection

Automatic detection of speech/non-speech boundaries is needed for a number of valuable user experiences such as "Push once to talk" or "hands-free dialog". In press-once to talk the user manually interacts with the app to indicate that the app should start listening. For example, they raise the device to their ear, press a button on the keypad, or touch a part of the screen. When they're done talking, the app automatically performs the speech recognition without the user needing to touch the device again. In hands-free dialog, where the user can start and stop talking without any manual input to indicate when the application should be listening. The application and/or browser needs to automatically detect when the user has started talking, so it can initiate speech recognition. This is particularly useful for in-car, or 10-foot usage (e.g. living room), or for people with disabilities.

3.3.9 Temporal Structure of Synthesis to Provide Visual Feedback

The application may wish to visually highlight the word or phrase that the application is synthesizing. Or, alternatively, the visual application may wish to coordinate the synthesis with animations of an avatar speaking or with appropriately timed slide transitions and thus need to know where in the reading of the synthesized text the application currently is. In addition, the application may wish to know where in a piece of synthesized text an interruption occurred and use the temporal feedback to tell.

3.3.10 Hello World

The web page when loaded may wish to say a simple phrase of synthesized text such as "hello world".

3.3.11 Speech Translation

The application can act as a translator between two individuals fluent in different languages. The application can listen to one speaker and understand the utterances in one language, can translated the spoken phrases to a different language, and then can speak the translation to the other individual.

3.3.12 Speech Enabled Email Client

The application reads out subjects and contents of email and also listens for commands, for instance, "archive", "reply: ok, let's meet at 2 pm", "forward to bob", "read message". Some commands may relate to VCR like controls of the message being read back, for instance, "pause", "skip forwards", "skip back", or "faster". Some of those controls may include controls related to parts of speech, such as, "repeat last sentence" or "next paragraph".

One other important email scenario is that when an email message is received, a summary notification may be raised that displays a small amount of content (for instance the person the email is from and a couple of words of the subject). It is desirable that a speech API be present and listening for the duration of this notification, allowing a user experience of being able to say "Reply to that" or "Read that email message". Note that this recognition UI could not be contingent on the user clicking a button, as that would defeat much of the benefit of this scenario (being able to reply and control the email without using the keyboard or mouse).

3.3.13 Dialog Systems

The type of dialogs that allow for collecting multiple pieces of information in either one turn or sequential turns in response to frequently synthesized prompts. Types of dialogs might be around ordering a pizza or booking a flight route complete with the system repeating back the choices the user said. This dialog system may well be represented by a VXML form or application that allows for control of the dialog. The VXML dialog may be fetched using XMLHttpRequest.

3.3.14 Multimodal Interaction

The ability to mix and integrate input from multiple modalities such as by saying "I want to go from here to there" while tapping two points on a touch screen map.

3.3.15 Speech Driving Directions

A direction service that speaks turn-by-turn directions. Accepts hands-free spoken instructions like "navigate to [address]" or "navigate to [business listing]" or "reroute using [road name]". Input from the location of the user may help the service know when to play the next direction. It is possible that user is not able to see any output so the service needs to regularly synthesize phrases like "turn left on [road] in [distance]".

3.3.16 Multimodal Video Game

The user combines speech input and output with tactile input and visual output to enable scenarios such as tapping a location on the screen while issuing an in game command like "Freeze Spell". Speech could be used either to initiate the action or as an inventory changing system, all while the normal action of the video game is continuing.

3.3.17 Multimodal Search

The user points their cell phone camera at an object, and uses voice to issue some query about that object (such as "what is this", "where can I buy one", or "do these also come in pink").

3.4 Individual Proposals

The following individual proposals were sent in to the group to help drive discussion.

From Google, a speech input API with a modification and a TTS proposal.
From Mozilla, a speech input proposal.
From Microsoft, a speech and tts proposal.
From Voxeo, a description of Voxeo's Javascript Tropo API.

3.5 Solution Design Agreements and Alternatives

This section attempts to capture the major design decisions the group made. In cases where substantial disagreements existed, the relevant alternatives are presented rather than a decision. Note that text only went into this section if it either represented group consensus or an accurate description of the specific alternative, as appropriate.

3.5.1 General Design Decisions

There are three aspects to the solution which must be addressed: communication with and control of speech services, a script-level API, and markup-level hooks and capabilities.
The script API will be Javascript.
The scripting API is the primary focus, with all key functionality available via scripting. Any HTML markup capabilities, if present, will be based completely on the scripting capabilities.
Notifications from the user agent to the web application should be in the form of Javascript events/callbacks.
For ASR, there must at least be these three logical functions:
1. start speech input and start processing
2. stop speech input and get result
3. cancel (stop speech input and ignore result)
For TTS, there must be at least these two logical functions:
1. play
2. pause
There is agreement that it should be possible to stop playback, but there is not agreement on the need for an explicit stop function.
It must be possible for a web application to specify the speech engine.
Speech service implementations must be referenceable by URI.
It must be possible to reference ASR grammars by URI.
It must be possible to select the ASR language using language tags.
It must be possible to leave the ASR grammar unspecified. Behavior in this case is not yet defined.
The XML format of SRGS 1.0 is mandatory to support, and it is the only mandated grammar format. Note in particular that this means we do not have any requirement for SLM support or SRGS ABNF support.
For TTS, SSML 1.1 is mandatory to support, as is UTF-8 plain text. These are the only mandated formats.
SISR 1.0 support is mandatory, and it is the only mandated semantic interpretation format.
There must be no technical restriction that would prevent using only TTS or only ASR.
There must be no technical restriction that would prevent implementing only TTS or only ASR. There is *mostly* agreement on this.
There will be a mandatory set of capabilities with stated limitations on interoperability.
For reco results, both the DOM representation of EMMA and the XML text representation must be provided.
For reco results, a simple Javascript representation of a list of results must be provided, with each result containing the recognized utterance, confidence score, and semantic interpretation. Note that this may need to be adjusted based on any decision regarding support for continuous recognition.
For grammar URIs, the "HTTP" and "data" protocol schemes must be supported.
A standard set of common-task grammars must be supported. The details of what those are is TBD.
The API should be able to start speech reco without having to select a microphone, i.e., there must be a notion of a "default" microphone.
There should be a default user interface.
The user agent must notify the user when audio is being captured. Web applications must not be able to override this notification.
It must be possible to customize the user interface to control how recognition start is indicated.
If the HTML standard has an audio capture API, we should be able to use it for ASR. If not, we should not create one, and we will not block waiting for one to be created.
We will collect requirements on audio capture APIs and relay them to relevant groups.
A low-latency endpoint detector must be available. It should be possible for a web app to enable and disable it, although the default setting (enabled/disabled) is TBD. The detector detects both start of speech and end of speech and fires an event in each case.
The API will provide control over which portions of the captured audio are sent to the recognizer.
We expect to have the following six audio/speech events: onaudiostart/onaudioend, onsoundstart/onsoundend, onspeechstart/onspeechend. The onsound* events represent a "probably speech but not sure" condition, while the onspeech* events represent the recognizer being sure there's speech. The former are low latency. An end event can only occur after at least one start event of the same type has occurred. Only the user agent can generate onaudio* events, the energy detector can only generate onsound* events, and the speech service can only generate onspeech* events.
There are 3 classes of codecs: audio to the web-app specified ASR engine, recognition from existing audio (e.g., local file), and audio from the TTS engine. We need to specify a mandatory-to-support codec for each.
It must be possible to specify and use other codecs in addition to those that are mandatory-to-implement.
Support for streaming audio is required -- in particular, that ASR may begin processing before the user has finished speaking.
It must be possible for the recognizer to return a final result before the user is done speaking.
We will require support for http for all communication between the user agent and any selected engine, including chunked http for media streaming, and support negotiation of other protocols (such as WebSockets or whatever RTCWeb/WebRTC comes up with).
Maxresults should be an ASR parameter representing the maximum number of results to return.
The user agent will use the URI for the ASR engine exactly as specified by the web application, including all parameters, and will not modify it to add, remove, or change parameters.
The scripting API communicates its parameter settings by sending them in the body of a POST request as Media Type "multipart". The subtype(s) accepted (e.g., mixed, formdata) are TBD.
If an ASR engine allows parameters to be specified in the URI in addition to in the POST body, when a parameter is specified in both places the one in the body takes precedence. This has the effect of making parameters set in the URI be treated as default values.
We cannot expect consistency in language support and performance/quality.
We agree that there must be API-level consistency regardless of user agent and engine.
We agree on having the same level of consistency across all four of the following categories:
1. consistency between different UAs using their default engine
2. consistency between different UAs using web app specified engine
3. consistency between different UAs using different web specified engines
4. consistency between default engine and specified engines
With exception that #4 may have limitations due to privacy issues.
From this point on we will use "service" rather than "engine" because a service may be a proxy for more than one engine.
We will not support selection of service by characteristics.
Add to list of expected inconsistency (change from existing wording of interoperability): reco performance including maximum size on parameters, microphone characteristics, semantics and exact values of sensitivity and confidence, time need to perform ASR/TTS, latencies, endpoint sensitivity and latency, result contents, presence/absence of optional events, recorded waveform
For continuous recognition, we must support the ability to change grammars and parameters for each chunk/frame/result
If the user's device is emitting other sounds than those produced by the current HTML page, there is no particular requirement that the User Agent be required to detect/reduce/eliminate it.
If a web app specifies a speech service and it is not available, an error is thrown. No automatic fallback to another service or the default service takes place.
The API should provide a way to determine if a service is available before trying to use the service; this applies to the default service as well.
The API must provide a way to query the availability of a specific configuration of a service.
The API must provide a way to ask the user agent for the capabilities of a service. In the case of private information that the user agent may have when the default service is selected, the user agent may choose to answer with "no comment" (or equivalent).
Informed user consent is required for all use of private information. This includes list of languages for ASR and voices for TTS. When such information is requested by the web app or speech service and permission is refused, the API must return "no comment" (or equivalent).
It must be possible for user permission to be granted at the level of specific web apps and/or speech services.
User agents, acting on behalf of the user, may deny the use of specific web apps and/or speech services.
The API will support multiple simultaneous grammars, any combination of allowed grammar formats. It will also support a weight on each grammar.
The API will support multiple simultaneous requests to speech services (same or different, ASR and TTS).
We disagree about whether there needs to be direct API support for a single ASR request and single TTS request that are tied together.
It must be possible to individually control ASR and TTS.
It must be possible for the web app author to get timely information about recognition event timing and about TTS playback timing. It must be possible for the web app author to determine, for any specific UA local time, what the previous TTS mark was and the offset from that mark.
It must be possible for the web app to stop/pause/silence audio output directly at the client/user agent.
When audio corresponding to TTS mark location begins to play, a Javascript event must be fired, and the event must contain the name of the mark and the UA timestamp for when it was played.
It must be possible to specify service-specific parameters in both the URI and the message body. It must be clear in the API that these parameters are service-specific, i.e., not standard.
Every message from UA to speech service should send the UA-local timestamp.
API must have ability to set service-specific parameters using names that clearly identify that they are service-specific, e.g., using an "x-" prefix. Parameter values can be arbitrary Javascript objects.
EMMA already permits app-specific result info, so there is no need to provide other ways for service-specific information to be returned in the result.
The API must support DOM 3 extension events as defined (which basically require vendor prefixes). See http://www.w3.org/TR/2009/WD-DOM-Level-3-Events-20090908/#extending_events-Vendor_Extensions. It must allow the speech service to fire these events.
The protocol must send its current timestamp to the speech service when it sends its first audio data.
It must be possible for the speech service to instruct the UA to fire a vendor-specific event when a specific offset to audio playback start is reached by the UA. What to do if audio is canceled, paused, etc. is TBD.
HTTPS must also be supported.
Using web app in secure communication channel should be treated just as when working with all secured sites (e.g., with respect to non-secured channel for speech data).
Default speech service implementations are encouraged not to use unsecured network communication when started by a web app in a secure communication channel
In Javascript, speech reco requests should have an attribute for a sequence of grammars, each of which can have properties, including weight (and possibly language, but that is TBD).
In Javascript will be able to set parameters as dot properties and also via a getParameters method. Browser should also allow service-specific parameters to be set this way.
Bjorn's email on continuous recognition represents our decisions regarding continuous recognition, except that there needs to be a feedback mechanism which could result in the service sending replaces. We may refer to "intermediate" as "partial", but naming changes such as this are TBD.
There will be an API method for sending text input rather than audio. There must also be a parameter to indicate how text matching should be done, including at least "strict" and "fuzzy". Other possible ways could be defined as vendor-specific additions.
It must be possible to do one or more re-recognitions with any request that you have indicated before first use that it can be re-recognized later. This will be indicated in the API by setting a parameter to indicate re-recognition. Any parameter can be changed, including the speech service.
In the protocol, the client must store the audio for re-recognition. It may be possible for the server to indicate that it also has stored the audio so it doesn't have to be resent.
Once there is a way (defined by another group) to get access to some blob of stored audio, we will support re-recognition of it.
No explicit need for JSON format of EMMA, but we might use it if it existed.
Candidate codecs to consider are Speex, FLAC, and Ogg Vorbis, in addition to plain old mu-law/a-law/linear PCM.
Protocol design should not prevent implementability of low-latency event delivery.
Protocol should support the client to begin TTS playback before receipt of all of the audio.
We will not require support for video codecs. However, protocol design must not prohibit transmission of codecs that have the same interface requirements as audio codecs.
Every event from speech service to the user agent must include timing information that the UA can convert into a UA-local timestamp. This timing info must be for the occurrence represented by the event, not the event time itself. For example, an end-of-speech event would contain timing for the actual end of speech, not the time when the speech service realizes end of speech occurred or when the event is sent.
This group will not define an explicit recording capability at this time. Existing use cases can be satisfied either via a recognizer's recording capability or via protocols defined outside this group.

3.5.2 Speech Service Communication and Control Design Decisions

This is where design decisions regarding control of and communication with remote speech services, including media negotiation and control, will be recorded.

3.5.3 Script API Design Decisions

This is where design decisions regarding the script API capabilities and realization will be recorded.

It must be possible to define at least the following handlers (names TBD):
- onspeechstart (not yet clear precisely what start of speech means)
- onspeechend (not yet clear precisely what end of speech means)
- onerror (one or more handlers for errors)
- a handler for when the recognition result is available
Note: significant work is needed to get interoperability here.

3.5.4 Markup API Design Decisions

This is where design decisions regarding the markup changes and/or enhancements will be recorded.

3.6 Proposed Solutions

The following sections cover proposed solutions that this Incubator Group recommends. The proposed solutions represent the consensus of the group, except where clearly indicated that an impass was reached.

3.6.1 Protocol Proposal

TBD

3.6.2 Web Application API Proposal