Workshop on Conversational Applications
— Summary —

18-19 June 2010

Hosted by Openstream, Somerset, NJ, US

Conversational Applications Workshop

On June 18th and 19th, 2010, W3C (the World Wide Web Consortium) held a Workshop on "Conversational Applications — Use Cases and Requirements for New Models of Human Language to Support Mobile Conversational Systems".

The minutes of the workshop are available on the W3C Web server:

The goal of the workshop was to understand the limitations of the current W3C language model in order to develop a more comprehensive model. The plan for the workshop was to collect and analyze use cases and prioritize requirements that ultimately would be used to identify improvements to the model of human language currently supported by W3C standards.

Openstream graciously hosted the workshop in Somerset, New Jersey, providing us with fabulous facilities conveniently located in the hotel, with an incredible amount and array of food, excellent Internet access, plentiful power, etc. In short, Openstream provided the perfect arrangements for this workshop.

The workshop had attendees from Openstream, Conversational Technologies, Voxeo, IBM, Cambridge Mobile, Redstart Systems, Loquendo, Nuance, NICT, AT&T, Verizon Wireless and W3C.

The first day was spent on brief presentations of the attendees' position papers, along with discussion. The presentation topic sessions were:

At the end of the first day the presenters were asked to write up answers to the following:

  1. Describe a situation that demonstrates the issue.
  2. Describe your implementation.
  3. Why were you not able to use only existing standards to accomplish this?
  4. What might you suggest could be standardized?

During the second day, we broke into smaller groups to work on extracting detailed use cases based on the answers to questions 1 and 3.

After combining similar use cases, we then did a rough straw poll to determine the approximate level of group interest in each use case. The use cases, roughly in order from most group interest to least, were:

  1. Dynamic on the fly activation deactivation or combination of any constrained and unconstrained recognition (SRGS (Speech Recognition Grammar Specification) or SLM (Statistical Language Model) grammars) or reco constraints. There are applications that in intelligent conversation combine open-ended and restricted language, so we need a mechanism to specify how to combine any recognition constraint. Moreover we need to be able to dynamically weight those recognition constraints based on context.
  2. Applications need to be sensitive to certain (arbitrary, dynamically extracted) features, e.g. gender, age, etc. Example: adjust voice, phrasing, etc. based on those features. Current limitation: current VXML infrastructure only allows words/interpretations/confidence. We need a place to put this info so that it gets transmitted to the application.
  3. Syntactic Formalism. Today an author cannot create a syntactic grammar for comprehensive NL (natural language) because the formalism lacks features inheritance, POS (part of speech) terminals, concord, inversion, etc. A new formalism should be created.
  4. Semantic representation of dialogue state that can include any kind of data (e.g. history slot conditions, user models, expectation of next system actions). The problem is that the current VXML 2.0 specification does not support a container of dialog states that contains multiple hypotheses of dialogue state.
  5. Shared Syntactic Grammars (for simultaneously running applications); Combine reco constraints when multiple apps are active simultaneously and transfer focus.
  6. Some dialogue systems contain discourse and WSD (word-sense disambiguation) info that could be used to improve spoken rendering. Example : "record" (noun vs. verb). Need a mechanism to convey the information between those components without having to modify the categorization either of the dialogue or the synthesis system.
  7. R&D Agility: As we do research we develop new algorithms that need new information and we would like to experiment with them before standardization. We need a reliable mechanism in VXML to carry this information. Examples: add location information, new DSR (distributed speech recognition) signal features. We would like a standard way to set vendor-specific recognition result info that was required to be passed to the application.
  8. Need a way for users to resolve conflicting commands. Need for a way to organize and share commands, for the user to have a way to remember and prioritize commands. So, one solution might be to have user-configuration for conflict resolution. Today users can't find commands, adjust them, organize them, or share them.
  9. EMMA Extension: Multi source input and corresponding confidence. In multimodal applications and more advanced applications, input might come from a variety of simultaneous source such as text, speech, GPS (global positioning system), world knowledge, user profile, etc. For instance, I might say: "I want to go to Denver" and the application can know from GPS where I am. Perhaps each concept or slot could have a multiple set of input sources with corresponding values and confidence scores.
  10. Focus change - users need a way to tell the device how to control focus. When using a mouse the focus is clear, but not necessarily so when using speech commands. There's no standard way to do this.
  11. Users are afraid to make mistakes using speech: users need a way to undo both actions and text events.
  12. Interactions between lexicons and grammars don't include additional info such as POS (part of speech), Grammatical, or other. Example: it would be nice to annotate a name with the region (location) to influence pronunciation.
  13. EMMA Extension for Richer Semantic Representation. We want to be able to represent the semantics of complex NL. Example: "give me all the toppings except onions", or "I want to leave this afternoon or tomorrow morning to arrive before noon". In current standards we can represent attribute-value pairs and there is hierarchy in EMMA, but there is no way to specify modifiers and quantifiers between slots.
  14. Phoneme sets: an author can't create a component (app, lexicon, asr engine, tts engine, etc.) which is assured to be interoperable with other components in terms of the phoneme set. The author should be able to use and specify a pre-defined standard phoneme set.
  15. Problem Solving: we want to be able to do applications that solve complex problems, like help desk problem solving. The call control logic of such an application cannot today be efficiently described as a state machine; therefore, available standards (VXML, SCXML) are insufficient to implement these applications, e.g. probabilistic rule engine, or task agent system.
  16. Morphology engine: today, there is no engine component or formalism for morphology. This is required to create appropriate replies and provide a higher level of abstraction for developers and systems. Therefore a new formalism and engine component for morphology should be created.

As a group, for the top 5 use cases we then brainstormed on possible new standards or extensions to existing ones that could possibly address each use case.

The use cases presented above will next be sent to the W3C Voice Browser and Multimodal Interaction Working Groups, where they will be reviewed and the groups will make recommendations for changes to existing specifications and/or suggestions for new specifications.

The Call for Participation, the Logistics, the Presentation Guideline, the Agenda and the minutes are also available on the W3C Web server.

Daniel C. Burnett, Deborah Dahl, Kazuyuki Ashimura and James A. Larson, Workshop Organizing Committee

$Id: summary.html,v 1.40 2010/06/30 20:11:46 ashimura Exp $