Workshop on Conversational Applications
— Summary —
18-19 June 2010
Hosted by Openstream, Somerset, NJ, US
On June 18th and 19th, 2010, W3C (the World Wide Web Consortium) held
a Workshop on "Conversational Applications — Use Cases and
Requirements for New Models of Human Language to Support Mobile
Conversational Systems".
The minutes of the workshop are available on the W3C Web server:
http://www.w3.org/2010/02/convapps/minutes.html
The goal of the workshop was to understand the limitations of the
current W3C language model in order to develop a more comprehensive
model. The plan for the workshop was to collect and analyze use cases
and prioritize requirements that ultimately would be used to identify
improvements to the model of human language currently supported by W3C
standards.
Openstream graciously hosted the workshop in Somerset, New Jersey,
providing us with fabulous facilities conveniently located in the
hotel, with an incredible amount and array of food, excellent Internet
access, plentiful power, etc. In short, Openstream provided the
perfect arrangements for this workshop.
The workshop had attendees from Openstream, Conversational
Technologies, Voxeo, IBM, Cambridge Mobile, Redstart Systems,
Loquendo, Nuance, NICT, AT&T, Verizon Wireless and W3C.
The first day was spent on brief presentations of the attendees'
position papers, along with discussion. The presentation topic
sessions were:
- Lexical and Morphological Standards
- Grammars, Semantics, and Natural Language processing
- Architectures and Dialog System Integration
At the end of the first day the presenters were asked to write up
answers to the following:
- Describe a situation that demonstrates the issue.
- Describe your implementation.
- Why were you not able to use only existing standards to accomplish this?
- What might you suggest could be standardized?
During the second day, we broke into smaller groups to work on
extracting detailed use cases based on the answers to questions 1 and
3.
After combining similar use cases, we then did a rough straw poll to
determine the approximate level of group interest in each use case.
The use cases, roughly in order from most group interest to least,
were:
- Dynamic on the
fly activation deactivation or combination of any constrained and
unconstrained recognition (SRGS (Speech Recognition Grammar
Specification) or SLM (Statistical Language Model) grammars) or reco
constraints. There are applications that in intelligent conversation
combine open-ended and restricted language, so we need a mechanism to
specify how to combine any recognition constraint. Moreover we need
to be able to dynamically weight those recognition constraints based
on context.
- Applications
need to be sensitive to certain (arbitrary, dynamically extracted)
features, e.g. gender, age, etc. Example: adjust voice, phrasing,
etc. based on those features. Current limitation: current VXML
infrastructure only allows words/interpretations/confidence. We need
a place to put this info so that it gets transmitted to the
application.
- Syntactic
Formalism. Today an author cannot create a syntactic grammar for
comprehensive NL (natural language) because the formalism lacks
features inheritance, POS (part of speech) terminals, concord,
inversion, etc. A new formalism should be created.
- Semantic
representation of dialogue state that can include any kind of data
(e.g. history slot conditions, user models, expectation of next system
actions). The problem is that the current VXML 2.0 specification does
not support a container of dialog states that contains multiple
hypotheses of dialogue state.
-
Shared Syntactic Grammars (for simultaneously running applications);
Combine reco constraints when multiple apps are active simultaneously
and transfer focus.
- Some dialogue systems contain discourse and WSD (word-sense
disambiguation) info that could be used to improve spoken rendering.
Example : "record" (noun vs. verb). Need a mechanism to convey the
information between those components without having to modify the
categorization either of the dialogue or the synthesis system.
- R&D Agility: As we do research we develop new
algorithms that need new information and we would like to experiment
with them before standardization. We need a reliable mechanism in
VXML to carry this information. Examples: add location information,
new DSR (distributed speech recognition) signal features. We would
like a standard way to set vendor-specific recognition result info
that was required to be passed to the application.
- Need a way for users to resolve conflicting commands. Need for a
way to organize and share commands, for the user to have a way to
remember and prioritize commands. So, one solution might be to have
user-configuration for conflict resolution. Today users can't find
commands, adjust them, organize them, or share them.
- EMMA Extension: Multi source input and
corresponding confidence. In multimodal applications and more
advanced applications, input might come from a variety of simultaneous
source such as text, speech, GPS (global positioning system), world
knowledge, user profile, etc. For instance, I might say: "I want to
go to Denver" and the application can know from GPS where I
am. Perhaps each concept or slot could have a multiple set of input
sources with corresponding values and confidence scores.
- Focus change - users need a way to tell the device how to control
focus. When using a mouse the focus is clear, but not necessarily so
when using speech commands. There's no standard way to do this.
- Users are afraid to make mistakes using speech: users need a way to
undo both actions and text events.
- Interactions between lexicons and grammars don't include additional
info such as POS (part of speech), Grammatical, or other. Example: it
would be nice to annotate a name with the region (location) to
influence pronunciation.
- EMMA Extension for Richer Semantic Representation.
We want to be able to represent the semantics of complex NL. Example:
"give me all the toppings except onions", or "I want to leave this
afternoon or tomorrow morning to arrive before noon". In current
standards we can represent attribute-value pairs and there is
hierarchy in EMMA, but there is no way to specify modifiers and
quantifiers between slots.
- Phoneme sets: an author can't create a component (app, lexicon, asr
engine, tts engine, etc.) which is assured to be interoperable with
other components in terms of the phoneme set. The author should be
able to use and specify a pre-defined standard phoneme set.
- Problem Solving: we want to be able to do
applications that solve complex problems, like help desk problem
solving. The call control logic of such an application cannot today be
efficiently described as a state machine; therefore, available
standards (VXML, SCXML) are insufficient to implement these
applications, e.g. probabilistic rule engine, or task agent
system.
- Morphology engine: today, there is no engine component or formalism
for morphology. This is required to create appropriate replies and
provide a higher level of abstraction for developers and systems.
Therefore a new formalism and engine component for morphology should
be created.
As a group, for the top 5 use cases we then brainstormed on possible
new standards or extensions to existing ones that could possibly
address each use case.
The use cases presented above will next be sent to the W3C Voice
Browser and Multimodal Interaction Working Groups, where they will be
reviewed and the groups will make recommendations for changes to
existing specifications and/or suggestions for new specifications.
The Call for Participation,
the Logistics,
the Presentation Guideline,
the Agenda and
the minutes
are also available on the W3C Web server.
Daniel C. Burnett,
Deborah Dahl,
Kazuyuki Ashimura
and
James A. Larson,
Workshop Organizing Committee
$Id: summary.html,v 1.40 2010/06/30 20:11:46 ashimura Exp $