HTML Speech Incubator Group Final Report (Internal Draft)

1 Terminology

The key words MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT, SHOULD, SHOULD NOT, RECOMMENDED, MAY, and OPTIONAL in this specification are to be interpreted as described in [IETF RFC 2119].

2 Overview

This document presents the deliverables of the HTML Speech Incubator Group. First, it presents the requirements developed by the group, ordered by priority of interest of the group members. Next, it briefly describes and points to the major individual proposals sent in to the group as proof-of-concept examples to help the group be aware of both possibilities and tradeoffs. It then presents design possibilities on important topics, providing decisions where the group had consensus and alternatives where multiple strongly differing opinions existed, with a focus on satisfying the high-interest requirements. Finally, the document contains (all or some of) a proposed solution that addresses the high-interest requirements and the design decisions.

The major steps the group took in working towards API recommendations, rather than just the final decisions, are recorded to act as an aid to any future standards-track efforts in understanding the motivations that drove the recommendations. Thus, even if a final standards-track document differs from any API recommendations in this document, the final standard should address the requirements and design decisions laid out by this Incubator Group.

3 Deliverables

According to the charter, the group is to produce one deliverable, this document. It goes on to state that the document may include

Requirements
Use cases
Change requests to HTML5 and, as appropriate, other specifications, e.g., capture API, CSS, Audio XG, EMMA, SRGS, VoiceXML 3

The group has developed requirements, some with use cases, and has made progress towards one or more API proposals that are effectively change requests to other existing standard specifications. These subdeliverables follow.

3.1 Prioritized Requirements

The HTML Speech Incubator Group developed and prioritized requirements as described in the Requirements and use cases document. A summary of the results is presented below with requirements listed in priority order, and segmented into those with strong interest, those with moderate interest, and those with mild interest. Each requirement is linked to its description in the requirements document.

3.1.1 Strong Interest

A requirement was classified as having "strong interest" if at least 80% of the group believed it needs to be addressed by any specification developed based on the work of this group. These requirements are:

3.1.2 Moderate Interest

A requirement was classified as having "moderate interest" if less than 80% but at least 50% of the group believed it needs to be addressed by any specification developed based on the work of this group. These requirements are:

3.1.3 Mild Interest

A requirement was classified as having "mild interest" if less than 50% of the group believed it needs to be addressed by any specification developed based on the work of this group. These requirements are:

3.2 Individual Proposals

The following individual proposals were sent in to the group to help drive discussion.

From Google, a speech input API with a modification and a TTS proposal.
From Mozilla, a speech input proposal.
From Microsoft, a speech and tts proposal.
From Voxeo, a description of Voxeo's Javascript Tropo API.

3.3 Solution Design Agreements and Alternatives

This section attempts to capture the major design decisions the group made. In cases where substantial disagreements existed, the relevant alternatives are presented rather than a decision. Note that text only went into this section if it either represented group consensus or an accurate description of the specific alternative, as appropriate.

3.3.1 General Design Decisions

There are three aspects to the solution which must be addressed: communication with and control of speech services, a script-level API, and markup-level hooks and capabilities.
The script API will be Javascript.
The scripting API is the primary focus, with all key functionality available via scripting. Any HTML markup capabilities, if present, will be based completely on the scripting capabilities.
Notifications from the user agent to the web application should be in the form of Javascript events/callbacks.
For ASR, there must at least be these three logical functions:
1. start speech input and start processing
2. stop speech input and get result
3. cancel (stop speech input and ignore result)
For TTS, there must be at least these two logical functions:
1. play
2. pause
There is agreement that it should be possible to stop playback, but there is not agreement on the need for an explicit stop function.
It must be possible for a web application to specify the speech engine.
Speech service implementations must be referenceable by URI.
It must be possible to reference ASR grammars by URI.
It must be possible to select the ASR language using language tags.
It must be possible to leave the ASR grammar unspecified. Behavior in this case is not yet defined.
The XML format of SRGS 1.0 is mandatory to support, and it is the only mandated grammar format. Note in particular that this means we do not have any requirement for SLM support or SRGS ABNF support.
For TTS, SSML 1.1 is mandatory to support, as is UTF-8 plain text. These are the only mandated formats.
SISR 1.0 support is mandatory, and it is the only mandated semantic interpretation format.
There must be no technical restriction that would prevent using only TTS or only ASR.
There must be no technical restriction that would prevent implementing only TTS or only ASR. There is *mostly* agreement on this.
There will be a mandatory set of capabilities with stated limitations on interoperability.
For reco results, both the DOM representation of EMMA and the XML text representation must be provided.
For reco results, a simple Javascript representation of a list of results must be provided, with each result containing the recognized utterance, confidence score, and semantic interpretation. Note that this may need to be adjusted based on any decision regarding support for continuous recognition.
For grammar URIs, the "HTTP" and "data" protocol schemes must be supported.
A standard set of common-task grammars must be supported. The details of what those are is TBD.
The API should be able to start speech reco without having to select a microphone, i.e., there must be a notion of a "default" microphone.
There should be a default user interface.
The user agent must notify the user when audio is being captured. Web applications must not be able to override this notification.
It must be possible to customize the user interface to control how recognition start is indicated.
If the HTML standard has an audio capture API, we should be able to use it for ASR. If not, we should not create one, and we will not block waiting for one to be created.
We will collect requirements on audio capture APIs and relay them to relevant groups.
A low-latency endpoint detector must be available. It should be possible for a web app to enable and disable it, although the default setting (enabled/disabled) is TBD. The detector detects both start of speech and end of speech and fires an event in each case.
The API will provide control over which portions of the captured audio are sent to the recognizer.
We expect to have the following six audio/speech events, strictly ordered as shown using BNF: (onaudiostart (onsoundstart (onspeechstart onspeechend)? onsoundend)? onaudioend)?, where parentheses represent a sequence and ? represents optionality. Note that every start event must be paired with an end event and that the *speech* events cannot occur without the *sound* events, which cannot occur without the *audio* events. The *sound* events represent a "probably speech but not sure" condition, while the *speech* events represent the recognizer being sure there's speech. The former are low latency.
There are 3 classes of codecs: audio to the web-app specified ASR engine, recognition from existing audio (e.g., local file), and audio from the TTS engine. We need to specify a mandatory-to-support codec for each.
It must be possible to specify and use other codecs in addition to those that are mandatory-to-implement.
Support for streaming audio is required -- in particular, that ASR may begin processing before the user has finished speaking.
It must be possible for the recognizer to return a final result before the user is done speaking.
We will require support for http for all communication between the user agent and any selected engine, including chunked http for media streaming, and support negotiation of other protocols (such as WebSockets or whatever RTCWeb/WebRTC comes up with).
Maxresults should be an ASR parameter representing the maximum number of results to return.
The user agent will use the URI for the ASR engine exactly as specified by the web application, including all parameters, and will not modify it to add, remove, or change parameters.
The scripting API communicates its parameter settings by sending them in the body of a POST request as Media Type "multipart". The subtype(s) accepted (e.g., mixed, formdata) are TBD.
If an ASR engine allows parameters to be specified in the URI in addition to in the POST body, when a parameter is specified in both places the one in the body takes precedence. This has the effect of making parameters set in the URI be treated as default values.

3.3.2 Speech Service Communication and Control Design Decisions

This is where design decisions regarding control of and communication with remote speech services, including media negotiation and control, will be recorded.

3.3.3 Script API Design Decisions

This is where design decisions regarding the script API capabilities and realization will be recorded.

It must be possible to define at least the following handlers (names TBD):
- onspeechstart (not yet clear precisely what start of speech means)
- onspeechend (not yet clear precisely what end of speech means)
- onerror (one or more handlers for errors)
- a handler for when the recognition result is available
Note: significant work is needed to get interoperability here.

3.3.4 Markup API Design Decisions

This is where design decisions regarding the markup changes and/or enhancements will be recorded.

3.4 Proposed Solution

TBD after we make substantial progress on the design decisions.

A References

IETF RFC 2119: RFC 2119: Key words for use in RFCs to Indicate Requirement Levels. Internet Engineering Task Force, 1997. (See http://www.ietf.org/rfc/rfc2119.txt.)

B Glossary

The following glossary provides brief definitions of terms that may not be familiar to readers new to the technology domain of speech processing.

ASR
barge-in
endpointer/endpointing
SLM
TTS

C Topics remaining to be discussed

This section holds a non-exhaustive list of topics the group has yet to discuss. It is for working purposes only and will likely be removed when the report is complete.

How to get audio/microphone/capture access -- discussed 5 May 2011
Which, if any, audio codecs are mandatory to support -- partially discussed 12 May 2011
How TTS fits in (building off media playing versus its own thing)
User preferences, privacy, and consent
What is meant by "start of speech", "end of speech", and endpointing in general? How do transmission delays affect the definitions and what we want in terms of APIs? -- partially discussed 5 May 2011
How will eventing work in Javascript? What about event bubbling?
Do apps need to work consistently across engines, and if so, what does "consistent" mean? If this is about the user experience, what does "consistent user experience" mean? Who is the user -- the webapp author and the end user?
What is the fallback behavior, if any, when capabilities are not supported?
Are ASR and TTS separate APIs? What sharing is there between the two, and how does the linkage for, say, barge-in handling work?
How much of the detail of the API between the user agent and a speech service should be visible to the web application API and/or author?
What is the mechanism to use capabilities outside of the mandated interoperable set, and where (default engine, remote engine, or both) can they be used?
List the areas and/or situations in which interoperability is not guaranteed. An initial list is: reco lang, synth lang, synth voice characteristics, reco performance, synth performance, exact timings
Review CSS3 for relationship to our work, if any; perhaps schedule joint call. See the initial email threads on this topic from Doug Schepers, TV Raman, and Daniel Weck.
What to do about continuous (ongoing) recognition. Is this different from ongoing keyword or hotword identification?
Should we support mulitple simultaneous grammars?
Parameterization of speech recognizers. How does it work?
Which ASR parameters should we have (timeouts, speedvsaccuracy, etc.)? Which are optional?
Do we support simulated reco input (semantic analysis of text)?
Can app author provide feedback on reco result? For example, which n-best result was correct.
API or protocol to speech engines, particularly remote.
How do we talk to a remote speech service, and how much do we specify? -- discussed 19 May 2011
What are the builtin (standard set of) grammars and how can they be parameterized, if possible?
What is the mechanism for authors to include grammars directly within their app HTML documents? Is this inline XML grammars or data URIs, or something else?
Do we support re-recognition?
Do we support recognition from file?
Do we need a recognition context block capability?
How do we specify default recognition?
How do we specify local recognizers?
How do we define "recognizer"? What parameters can an author specify on abilities of a recognizer (TTS also)? What is the boundary between "selecting a recognizer" and "selecting the parameters of a recognizer"?
Do we support audio streaming and how? -- partially discussed 12 May 2011
Should it be possible to have a user interface in the web app (in addition to or instead of in the browser chrome)
What level of customization of the user interface is allowed? One question is whether customization can affect how audio capture is indicated -- there can be security issues if the app can customize that notification. -- partially discussed 5 May 2011
What happens with the *audio*, *sound*, and *speech* events in the case of an error?
Sensitivity and timeout parameters for ASR
Do we want to support (the audio contained in) video streams?
What is the behavior of onspeechend if onsoundend is detected first?
Do we permit repeated parameters in the URI, body, or both? What do they mean?
What subtype(s) of multipart do we support for POST bodies?