This is a slightly updated version of the paper we submitted yesterday,
just with Jerry Carter's affiliation updated.

Title: Extending SRGS to Support More Powerful and Expressive Grammars

Authors: Paolo Baggia (Loquendo), Jerry Carter (the Minerva Project),
Deborah Dahl (Conversational Technologies)


The Speech Recognition Grammar Specification (SRGS)[1], a context-free
grammar format for speech- and DTMF-based applications, has been
widely adopted by application developers since its publication as a
W3C Recommendation in 2004.  Both the VoiceXML 2.1 [2]authoring language
and the Media Resource Control Protocol (MRCP v2) [3] require support for
SRGS. In combination with a semantic tagging system, the Semantic
Interpretation for Speech Recognition (SISR)[4], SRGS has enabled grammar
portability across deployment platforms and enabled a generation of
platform-independent grammar authoring tools.

SRGS was published as a W3C Recommendation in 2004.  In the six years
since, technological improvements, authoring experience, and
deployment experience have motivated a number of extensions and
updates.  A few of these are outlined here.

1. Technical Advances: Speech recognition algorithms, and computer
   technology in general, have made significant advances since the
   Recommendation was published.  Available CPU, RAM, and hard drives
   have improved by at least a factor of 16.  Consequently
   applications are now possible that would not have been possible
   when SRGS was designed.

For  example:

a. Mixing dictation and grammar-based recognition is now commercially
feasible within a single utterance. One use case is an application
like dictating email, where the user says "subject" (grammar-based)
and "about the meeting" (dictation-based) in a single utterance.

b. SRGS currently can represent only context-free grammars. This
limitation was originally due to a concern for possible excessive
consumption of computer resources by parsing with more powerful
grammars, especially consumption of shared network resources. More
efficient speech recognizers and cheaper computing resources make this
less of a concern.  Context sensitive grammars can both represent more
complex syntactic constraints than context-free grammars but perhaps
even more important, they can represent syntactic constraints in a
more natural and easy to author fashion. Expressing syntactic
constraints can improve efficiency by reducing overgeneration. 

2. Experience gained from application development: Applications have
been developed with SRGS and other context-free formalisms and this
experience motivates a number of use cases.

a. Efficiency: It would be very useful to support boolean constraints
on rules to rule out illegal responses.  For example: Caller: I want
to fly from Boston to Boston.  (Austin mistaken for Boston). A
constraint like "not(toCity == fromCity)" could prune off the entire
search space where the cities are recognized to be the same.

b. Support DTMF and speech in the same document: There are use
cases for mixing DTMF and speech in input.  Use case: User says "my
PIN is (DTMF) 1234"

c. Allow branches of a grammar to be programmatically enabled / disabled:
This will reduce the search space during speech recognition and 
reduce overgeneration. An example would be a form filling application, where

recognizing the name of a field will enable only the branch for the
appropriate input, e.g. "birth date, January 2004, gender, female, ...".
Another example would be a form-filling application where children's
ages are requested only if the user has indicated that they have one or
more children.

3. More advanced Natural Language Support will support other advanced
applications. A few examples are:

a. Nomatch reduction: In many applications there are areas of a grammar 
where variations occur without semantic consequences for the overall
interpretation. For instance: "I would like to fly to Houston" can be
expressed in many other semantically equivalent ways, like "I gotta get to
Houston".
It would be desirable to weight in a different way an error in the prefix
part
or in the leading semantic city.

b. Enhanced semantics: the SRGS grammar might provide results to be passed
to higher level classification or semantic analysis modules to expand the
current analysis.

c. Enhances for keyboard inputs: where the normalization issues
(punctuation, whitespaces, capitalization) might be very important
issues, as well as the handling of small typos to avoid trivial errors
which would limit the ability of SRGS and SISR to parse the results
and build a semantic representation.

d. Support for partial results: ASR techniques allows the generation of 
partial results while the recognition process is still active. The
generation
of partial results from a SRGS and SISR grammar should also match extensions
in the MRCP protocol to allow partial results and possibly VoiceXML
extensions
to take full advantage of it. This would be useful, for example, in a
multimodal
application where the user can see the incremental results being displayed.
The
user could immediately cancel a partial result with an obvious error even
before
he or she has finished speaking. 

4. Standard Advances: SRGS would benefit from  some extensions to better
align with the advances of standardization.

a. Internationalization is an ongoing process and SRGS will benefit
from better syntax for language identification (e.g. Language Subtag
Registry in IANA), use of internationalized resource identifiers, and
supporting XML 1.1 [5]when needed.

b. Pronunciation Lexicon: Standards should be used by SRGS when they
exist (e.g. PLS 1.0)[6].  SRGS will require extensions to enable better
control of syntactic categories, semantic distinctions, or regional
pronunciation of location and proper names.

c. A Standard Semantic Result Format should be available to SRGS
authors: The EMMA 1.0 specification [7], for instance, allows the
semantic meaning of input from various modalities such as speech,
touch, and keyboard entry to be represented in a common format.  As
the future of SRGS promises a wider penetration into mobile devices
such as smartphones and tablets, a standard result format will greatly
facilitate the integration of separate input channels.  One use case
is tablet entry where the user speaks and touches the screen in
concert: "I need to drive from here to here."

Conclusions:

The Speech Recognition Grammar Specification has become the leading
format for grammar interoperability and satisfies many authoring
needs.  But it is not surprising that technical advancement has lead
to grander ambitions than can be easily built with SRGS 1.0.  There
will undoubtedly be scenarios where SRGS is inappropriate, but we
believe that these and other extensions to SRGS will address many of
those use cases.

[1] SRGS http://www.w3.org/TR/speech-grammar/
[2] VoiceXML 2.1 http://www.w3.org/TR/voicexml21/
[3] MRCP v2 http://tools.ietf.org/html/draft-ietf-speechsc-mrcpv2-20
[4] SISR http://www.w3.org/TR/semantic-interpretation/
[5] XML 1.1 http://www.w3.org/TR/xml11/
[6] PLS http://www.w3.org/TR/pronunciation-lexicon/
[7] EMMA http://www.w3.org/TR/emma/