Status of this Document
This section describes the status of this document at the
time of its publication. Other documents may supersede this
document. A list of current W3C publications and the latest
revision of this technical report can be found in the W3C technical reports index at
http://www.w3.org/TR/.
This document is the 18 May 2011 draft of the Final Report for
the HTML
Speech Incubator Group. Comments for this document are welcomed
to public-xg-htmlspeech@w3.org
(archives).
This document was produced according to the HTML Speech
Incubator Group's charter.
Please consult the charter for participation and intellectual
property disclosure requirements.
Publication as a W3C Note does not imply endorsement by the W3C
Membership. This is a draft document and may be updated, replaced
or obsoleted by other documents at any time. It is inappropriate to
cite this document as other than work in progress.
1 Terminology
The key words MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT,
SHOULD,
SHOULD
NOT, RECOMMENDED, MAY, and OPTIONAL in this
specification are to be interpreted as described in [IETF RFC 2119].
2 Overview
This document presents the deliverables of the HTML Speech
Incubator Group. First, it presents the requirements developed by
the group, ordered by priority of interest of the group members.
Next, it briefly describes and points to the major individual
proposals sent in to the group as proof-of-concept examples to help
the group be aware of both possibilities and tradeoffs. It then
presents design possibilities on important topics, providing
decisions where the group had consensus and alternatives where
multiple strongly differing opinions existed, with a focus on
satisfying the high-interest requirements. Finally, the document
contains (all or some of) a proposed solution that addresses the
high-interest requirements and the design decisions.
The major steps the group took in working towards API
recommendations, rather than just the final decisions, are recorded
to act as an aid to any future standards-track efforts in
understanding the motivations that drove the recommendations. Thus,
even if a final standards-track document differs from any API
recommendations in this document, the final standard should address
the requirements and design decisions laid out by this Incubator
Group.
3
Deliverables
According to the charter,
the group is to produce one deliverable, this document. It goes on
to state that the document may include
- Requirements
- Use cases
- Change requests to HTML5 and, as appropriate, other
specifications, e.g., capture API, CSS, Audio XG, EMMA, SRGS,
VoiceXML 3
The group has developed requirements, some with use cases, and
has made progress towards one or more API proposals that are
effectively change requests to other existing standard
specifications. These subdeliverables follow.
3.1 Prioritized
Requirements
The HTML Speech Incubator Group developed and prioritized
requirements as described in the
Requirements and use cases document. A summary of the results
is presented below with requirements listed in priority order, and
segmented into those with strong interest, those with moderate
interest, and those with mild interest. Each requirement is linked
to its description in the requirements document.
3.1.1 Strong
Interest
A requirement was classified as having "strong interest" if at
least 80% of the group believed it needs to be addressed by any
specification developed based on the work of this group. These
requirements are:
3.1.2
Moderate Interest
A requirement was classified as having "moderate interest" if
less than 80% but at least 50% of the group believed it needs to be
addressed by any specification developed based on the work of this
group. These requirements are:
3.1.3 Mild
Interest
A requirement was classified as having "mild interest" if less
than 50% of the group believed it needs to be addressed by any
specification developed based on the work of this group. These
requirements are:
3.2 Individual
Proposals
The following individual proposals were sent in to the group to
help drive discussion.
3.3 Solution Design Agreements
and Alternatives
This section attempts to capture the major design decisions the
group made. In cases where substantial disagreements existed, the
relevant alternatives are presented rather than a decision. Note
that text only went into this section if it either represented
group consensus or an accurate description of the specific
alternative, as appropriate.
3.3.1 General Design
Decisions
- There are three aspects to the solution which must be
addressed: communication with and control of speech services, a
script-level API, and markup-level hooks and capabilities.
- The script API will be Javascript.
- The scripting API is the primary focus, with all key
functionality available via scripting. Any HTML markup
capabilities, if present, will be based completely on the scripting
capabilities.
- Notifications from the user agent to the web application should
be in the form of Javascript events/callbacks.
- For ASR, there must at least be these three logical functions:
- start speech input and start processing
- stop speech input and get result
- cancel (stop speech input and ignore result)
- For TTS, there must be at least these two logical functions:
- play
- pause
There is agreement that it should be possible to stop playback, but
there is not agreement on the need for an explicit stop
function.
- It must be possible for a web application to specify the speech
engine.
- Speech service implementations must be referenceable by
URI.
- It must be possible to reference ASR grammars by URI.
- It must be possible to select the ASR language using language
tags.
- It must be possible to leave the ASR grammar unspecified.
Behavior in this case is not yet defined.
- The XML format of SRGS 1.0 is mandatory to support, and it is
the only mandated grammar format. Note in particular that this
means we do not have any requirement for SLM support or SRGS ABNF
support.
- For TTS, SSML 1.1 is mandatory to support, as is UTF-8 plain
text. These are the only mandated formats.
- SISR 1.0 support is mandatory, and it is the only mandated
semantic interpretation format.
- There must be no technical restriction that would prevent using
only TTS or only ASR.
- There must be no technical restriction that would prevent
implementing only TTS or only ASR. There is *mostly* agreement on
this.
- There will be a mandatory set of capabilities with stated
limitations on interoperability.
- For reco results, both the DOM representation of EMMA and the
XML text representation must be provided.
- For reco results, a simple Javascript representation of a list
of results must be provided, with each result containing the
recognized utterance, confidence score, and semantic
interpretation. Note that this may need to be adjusted based on any
decision regarding support for continuous recognition.
- For grammar URIs, the "HTTP" and "data" protocol schemes must
be supported.
- A standard set of common-task grammars must be supported. The
details of what those are is TBD.
- The API should be able to start speech reco without having to
select a microphone, i.e., there must be a notion of a "default"
microphone.
- There should be a default user interface.
- It must not be possible to customize the "system is listening"
(microphone open) notification. This is for security and privacy
reasons.
- It must be possible to customize the user interface to control
how recognition start is indicated.
- If the HTML standard has an audio capture API, we should be
able to use it for ASR. If not, we should not create one, and we
will not block waiting for one to be created.
- We will collect requirements on audio capture APIs and relay
them to relevant groups.
- A low-latency endpoint detector must be available. It should be
possible for a web app to enable and disable it, although the
default setting (enabled/disabled) is TBD. The detector detects
both start of speech and end of speech and fires an event in each
case.
- The API will provide control over which portions of the
captured audio are sent to the recognizer.
- We expect to have the following six audio/speech events,
strictly ordered as shown using BNF: (onaudiostart (onsoundstart
(onspeechstart onspeechend)? onsoundend)? onaudioend)?, where
parentheses represent a sequence and ? represents optionality. Note
that every start event must be paired with an end event and that
the *speech* events cannot occur without the *sound* events, which
cannot occur without the *audio* events. The *sound* events
represent a "probably speech but not sure" condition, while the
*speech* events represent the recognizer being sure there's speech.
The former are low latency.
- There are 3 classes of codecs: audio to the web-app specified
ASR engine, recognition from existing audio (e.g., local file), and
audio from the TTS engine. We need to specify a
mandatory-to-support codec for each.
- It must be possible to specify and use other codecs in addition
to those that are mandatory-to-implement.
- Support for streaming audio is required -- in particular, that
ASR may begin processing before the user has finished
speaking.
- It must be possible for the recognizer to return a final result
before the user is done speaking.
- We will require support for http for all communication between
the user agent and any selected engine, including chunked http for
media streaming, and support negotiation of other protocols (such
as WebSockets or whatever RTCWeb/WebRTC comes up with).
3.3.2 Speech
Service Communication and Control Design Decisions
This is where design decisions regarding control of and
communication with remote speech services, including media
negotiation and control, will be recorded.
3.3.3 Script API Design
Decisions
This is where design decisions regarding the script API
capabilities and realization will be recorded.
- It must be possible to define at least the following handlers
(names TBD):
- onspeechstart (not yet clear precisely what start of
speech means)
- onspeechend (not yet clear precisely what end of
speech means)
- onerror (one or more handlers for errors)
- a handler for when the recognition result is available
Note: significant work is needed to get interoperability
here.
3.3.4 Markup API Design
Decisions
This is where design decisions regarding the markup changes
and/or enhancements will be recorded.
3.4 Proposed Solution
TBD after we make substantial progress on the design
decisions.