This is a slightly updated version of the paper we submitted yesterday, just with Jerry Carter's affiliation updated. Title: Extending SRGS to Support More Powerful and Expressive Grammars Authors: Paolo Baggia (Loquendo), Jerry Carter (the Minerva Project), Deborah Dahl (Conversational Technologies) The Speech Recognition Grammar Specification (SRGS)[1], a context-free grammar format for speech- and DTMF-based applications, has been widely adopted by application developers since its publication as a W3C Recommendation in 2004. Both the VoiceXML 2.1 [2]authoring language and the Media Resource Control Protocol (MRCP v2) [3] require support for SRGS. In combination with a semantic tagging system, the Semantic Interpretation for Speech Recognition (SISR)[4], SRGS has enabled grammar portability across deployment platforms and enabled a generation of platform-independent grammar authoring tools. SRGS was published as a W3C Recommendation in 2004. In the six years since, technological improvements, authoring experience, and deployment experience have motivated a number of extensions and updates. A few of these are outlined here. 1. Technical Advances: Speech recognition algorithms, and computer technology in general, have made significant advances since the Recommendation was published. Available CPU, RAM, and hard drives have improved by at least a factor of 16. Consequently applications are now possible that would not have been possible when SRGS was designed. For example: a. Mixing dictation and grammar-based recognition is now commercially feasible within a single utterance. One use case is an application like dictating email, where the user says "subject" (grammar-based) and "about the meeting" (dictation-based) in a single utterance. b. SRGS currently can represent only context-free grammars. This limitation was originally due to a concern for possible excessive consumption of computer resources by parsing with more powerful grammars, especially consumption of shared network resources. More efficient speech recognizers and cheaper computing resources make this less of a concern. Context sensitive grammars can both represent more complex syntactic constraints than context-free grammars but perhaps even more important, they can represent syntactic constraints in a more natural and easy to author fashion. Expressing syntactic constraints can improve efficiency by reducing overgeneration. 2. Experience gained from application development: Applications have been developed with SRGS and other context-free formalisms and this experience motivates a number of use cases. a. Efficiency: It would be very useful to support boolean constraints on rules to rule out illegal responses. For example: Caller: I want to fly from Boston to Boston. (Austin mistaken for Boston). A constraint like "not(toCity == fromCity)" could prune off the entire search space where the cities are recognized to be the same. b. Support DTMF and speech in the same document: There are use cases for mixing DTMF and speech in input. Use case: User says "my PIN is (DTMF) 1234" c. Allow branches of a grammar to be programmatically enabled / disabled: This will reduce the search space during speech recognition and reduce overgeneration. An example would be a form filling application, where recognizing the name of a field will enable only the branch for the appropriate input, e.g. "birth date, January 2004, gender, female, ...". Another example would be a form-filling application where children's ages are requested only if the user has indicated that they have one or more children. 3. More advanced Natural Language Support will support other advanced applications. A few examples are: a. Nomatch reduction: In many applications there are areas of a grammar where variations occur without semantic consequences for the overall interpretation. For instance: "I would like to fly to Houston" can be expressed in many other semantically equivalent ways, like "I gotta get to Houston". It would be desirable to weight in a different way an error in the prefix part or in the leading semantic city. b. Enhanced semantics: the SRGS grammar might provide results to be passed to higher level classification or semantic analysis modules to expand the current analysis. c. Enhances for keyboard inputs: where the normalization issues (punctuation, whitespaces, capitalization) might be very important issues, as well as the handling of small typos to avoid trivial errors which would limit the ability of SRGS and SISR to parse the results and build a semantic representation. d. Support for partial results: ASR techniques allows the generation of partial results while the recognition process is still active. The generation of partial results from a SRGS and SISR grammar should also match extensions in the MRCP protocol to allow partial results and possibly VoiceXML extensions to take full advantage of it. This would be useful, for example, in a multimodal application where the user can see the incremental results being displayed. The user could immediately cancel a partial result with an obvious error even before he or she has finished speaking. 4. Standard Advances: SRGS would benefit from some extensions to better align with the advances of standardization. a. Internationalization is an ongoing process and SRGS will benefit from better syntax for language identification (e.g. Language Subtag Registry in IANA), use of internationalized resource identifiers, and supporting XML 1.1 [5]when needed. b. Pronunciation Lexicon: Standards should be used by SRGS when they exist (e.g. PLS 1.0)[6]. SRGS will require extensions to enable better control of syntactic categories, semantic distinctions, or regional pronunciation of location and proper names. c. A Standard Semantic Result Format should be available to SRGS authors: The EMMA 1.0 specification [7], for instance, allows the semantic meaning of input from various modalities such as speech, touch, and keyboard entry to be represented in a common format. As the future of SRGS promises a wider penetration into mobile devices such as smartphones and tablets, a standard result format will greatly facilitate the integration of separate input channels. One use case is tablet entry where the user speaks and touches the screen in concert: "I need to drive from here to here." Conclusions: The Speech Recognition Grammar Specification has become the leading format for grammar interoperability and satisfies many authoring needs. But it is not surprising that technical advancement has lead to grander ambitions than can be easily built with SRGS 1.0. There will undoubtedly be scenarios where SRGS is inappropriate, but we believe that these and other extensions to SRGS will address many of those use cases. [1] SRGS http://www.w3.org/TR/speech-grammar/ [2] VoiceXML 2.1 http://www.w3.org/TR/voicexml21/ [3] MRCP v2 http://tools.ietf.org/html/draft-ietf-speechsc-mrcpv2-20 [4] SISR http://www.w3.org/TR/semantic-interpretation/ [5] XML 1.1 http://www.w3.org/TR/xml11/ [6] PLS http://www.w3.org/TR/pronunciation-lexicon/ [7] EMMA http://www.w3.org/TR/emma/