W3C

Workshop on Conversational Applications
Minutes
18-19 Jun 2010

Conversational Applications Workshop

Attendees

Present - Day1
Paolo Baggia, Loquendo
Rahul Akolkar, IBM
Kurt Fuqua, Cambridge Mobile
Simona Gandrabur, Nuance
Matt Womer, W3C
David Nahamoo, IBM
Siva Yelamanchili, Openstream
Raj Tumuluri, Openstream
Kim Patch, Redstart Systems
Chiori Hori, NICT
Teruhisa Misu, NICT
Venkat Gaddam, Verizon Wireless
Balaji Narayana, Openstream
Kaz Ashimura, W3C
Dan Burnett, Voxeo
Debbie Dahl, W3C Invited Expert (Conversational Technologies)
Michael Johnston, AT&T
Present - Day2
Paolo Baggia, Loquendo
Kurt Fuqua, Cambridge Mobile
Simona Gandrabur, Nuance
Matt Womer, W3C
Raj Tumuluri, Openstream
Kim Patch, Redstart Systems
Chiori Hori, NICT
Teruhisa Misu, NICT
Kaz Ashimura, W3C
Dan Burnett, Voxeo
Debbie Dahl, W3C Invited Expert (Conversational Technologies)
Regrets
Jim Larson
Chair
Debbie Dahl, Dan Burnett
Scribe
Kaz, Rahul, Kim, Debbie, Paolo, Matt, Simona

Contents

Day1:

Day2:


Session1: Welcome

Moderator:
Debbie
Scribe:
Kaz

Welcome and logistics — Raj Tumuluri, Openstream

Welcome from the W3C — Kazuyuki Ashimura, W3C

[ Slides ]

Kaz asked attendees if it's OK to (1) take pictures for workshop home pages, (2) record discussion for minutes taking and (3) copying presentations to his USB memory, and got approval.

Introduction to the workshop — Debbie Dahl, W3C Invited Expert

[ Slides ]

Debbie gives a talk on the workshop and asks attendees to introduce themselves.

Everybody introduces themselves starting with Debbie.

Session 2 - Introduction

Moderator:
Dan
Scribe:
Rahul

dan: this session includes introductory papers, kurt will talk about an overview of conversational systems

Overview of conversational applications — Kurt Fuqua, Cambridge Mobile

[ Slides (TBD) ]

kurt: conversational systems are bi-direction
... for example, recognition and synthesis (for example, question and answer)
... both have common vocabulary, and therefore, may be useful to have a common lexicon
... for example, a number of transformations may be done from a question to an answer
... interrogative -> declarative, append 'no' as vocative, change tense, change plurality
... simple transformations at the parse tree level, complex at string level
... implies that conversation requires a parse tree (syntactic or morthologic)
... need systematic morphological transformations (grammatical feature tags)
... bi-directional and reversible, possible to compose and decompose
... morphology is a shared resource
... several levels of processing (syntactic, morphologic, phonologic / prosodic)
... levels should be definable, progammable and reversible
... Standardization issues: Phonemes vs. Allophones, Standard phoneme set, Standard parts of speech, Standard grammatical tags

The User Context: Aligning System and User Behavior — Kim Patch, Redstart Systems

[ Paper, Slides ]

[ context and aligning system and user behavior

kim: disabled users as speech testbed
... started using speech reco 16 years ago and was frustrating in many ways
... the community of disabled users can provide valuable insights into whats needed
... desktop speech interface study: 45% abandonment rate after one year
... complaints: frustration, vocal fatigue, couldn't use in group setting
... same issues for disabled and able users
... need consistent focus and feedback, let users anticipate a problem and solve it, discovery, adjustment, organization and sharing (afterthought: correcting vs. changing)

Session3: Lexical and morphological standards

Moderator:
Paolo
Scribe:
Kim

Information Transfer from Dialogue Response Generation to Speech Synthesis — Dan Burnett, Voxeo for Paul Bagshaw, France Telecom

[ Paper, Slides ]

Information transfer from dialog response generation

His use case is this -- that dialogue has information that should be available during speech synthesis

There are many things you can give the system -- but some things you
can't figure out -- speech acts, fleeing localization, semantic rules.

There is currently no way for that information there is a role
attribute and SSML. There is no standard interpretation of what the
role attribute are.

There's a structure required of it but there's no expressed
interpretation of what the values are.


Dialog system, speech generator. Each has its own internal
categorizations.

Each of these has their own classification that is appropriate for the
task they do. The dialogue system is doing certain things, the speech
generation system is doing certain things. The way they take about the
content they have.

It must be possible to provide a map between these rather than forcing
either to change.

His request is to not attempt to normalize categorizations or tag
sets. Instead provide a map.

It doesn't need to know whether it's a noun, verb or adjective,
because from for prospective there all the same.

Discussion:
------------

Kurt:  where would you put attribute

Dan: that is something the synthesizer already knows how to use -- a
  particular engine knows what it cares about. A dialog system does
  what it cares about. What's important is the categorization for the
  dialogue system needs to be made known in a standard. The TTS system
  may sit out. There needs to be a mapping language. You then have the
  complete path from the dialogue system to the speech generation
  system without forcing either one to change the way it thinks about
  the information has to process.


David:  the map is not bidirectional in this case

Dan: I believe the way he presents it is that one direction because
  it's the speech generation. You could conceive of this concept being
  used in the other direction as well -- maybe. But that's beyond my
  ability to speculate.


Jerry: attribute is assigned to each one for each utterance because
  granularity


Dan:  I think you go on the token level

Paolo: permission given to the lexicon -- can play other roles of the
  SSML. For prosody

Dan: there is no requirement that it be tied to lexicon

Dan: perhaps it needs to be expanded -- if it can be applied to a
  larger section of the attribute and not just the Tokien they could
  have a larger use:

Conversational Lexical Standards — Kurt Fuqua, Cambridge Mobile

[ Paper, Slides (TBD) ]

need for morphology

current system doesn't allow us to relate those words even though
there's an obvious relationship between the two

we need a higher level of abstraction -- a way that we can tokenize
words

much more powerful for applications because they can work on a higher
more abstract level perhaps closer to a semantic level

group that those tags mostly for European

some plurality in terms of morphology, and some languages and not
other languages.

The great thing about using tags is we let the engine render it,
however that language expresses it

Concorde:
----------
if we do that we can automatically get defined in the parse tree

three components on the lexical level for morphology

morphology engine carries out in implements morphology

formalism is grammar that represents morphology

we need to think along the terms of handling each of these

Lexicon contents:
------------------
Parts of Speech -- noun versus verb and usage

irregular or irregular, how does it formats plural sometimes there are
multiple ways of forming plurality

it's complicated -- I don't want to oversimplify it but we can do a
lot in that area

Parts of Speech:
-----------------

very important for computational reasons that we have no more than 15
Parts of Speech

Grammatical features:
---------------------

P: does the language included Asian language or Indian languages

Kurt: it does include many languages. There are
  complications. Prepositions exist in some languages, others post
  positional. These terms are broader than the label may represent --
  there are more complications.

Chronicle feature tags
-----------------------

Japanese is very different in terms of plurality and specificity

we need to have feature sets that are appropriate

we need to represent the values of family in different languages --
English plural or not plural. Arabic three values because there's the
dual.

Similarly transitivity is expressed differently in different
languages.  There's a lot we can do with that

Phonemes
---------

Google -- mixed

British English phoneme set is different than American set it's a mess
we need standards

Slappy 

Formalisms

Have to be reversible. It is possible to use totally and create
reversible -- computationally feasible we could consider

summary: slide we need to standardize

Discussion:
------------

Simona: the need for morphological tags. They semantic tags, but I
  still be the morphology tags? I would say no

Kurt: internally in the lexicon you're going to need to know that this
  is a verb and Parts of Speech for other reasons.

David: could you elaborate more on the internationalization. You are
  stating that somehow we can create a way of coming across languages
  easier if these constructs are there. I'm not sure that I see the
  ROI that much. How much help is a provide -- what sort of problem
  does it address?

Kurt: mobile applications developers are our primary target. These are
  small teams of people. They don't have a whole to invest so they can
  port to 12 languages. How are these folks went to create mobile
  applications that work across the space. The ROI issues if you can
  create this game in such a way that makes it easy to port these
  applications now you can have thousands of application developers
  porting to images were otherwise you only have a handful -- maybe it
  doesn't that actually ported to 20 language.

Raj: 

Kurt: if the morphology is written once -- a company like look window
  or nuance written for French, the application developer doesn't have
  to redo, use the shared resource. We just have to agree on what that
  standard is. It minimizes the work because it makes reusability on
  the morphology prosodic level. Even semantic. If we can make these
  components reusable and I think the ROI is there that we can make
  this economically feasible or currently it's not for most small
  teams.

David: I'm not questioning the value but application development most
  time is spent at least in the speech part of it on right of the
  grammar, designing the dialogue and a lot of those things. I'm just
  trying to see how these lower-level important key connections would
  map themselves into the higher-level. I'm not questioning that
  cannot be done, just trying to push to see how it can be done.

Paolo: not so detailed that you don't need more -- to pronounce
  certain words the way you want them pronounced -- is there a gap?
  That is why we were ambiguous at the time.

Kurt: why we need the phonetic and a. If you don't have phonemes
  representing you can write rules, you can only represent on the
  phoneme level.

Paolo: I suspect they would say we need more

Kurt: internally change the phoneme to allophones before rendered it
  if not you have to reverse it and figure out what the phonemes are

Kurt: there's no way in the world that developers can work on the
  allophonic level -- way too confusing

David: it should never sharpen the higher-level -- question is why
  does it show up on the higher-level. The assumption is speech
  recognition has a role to take the speech and gives you a sequence
  of words. Now you have to work on the meaning -- all kinds of
  other. Why do you need the phonemes

Kurt: someone has to write the rules for prosody and phrenology
  because they interact with morphology.

  I'm analyzing prosody as well -- you need to figure out where the
  focus is, what the reflection is -- you need to be able to get these
  tags so that you can process them and know what someone said.

David: we have always said that the small bit of information. I'm not
  saying there's no value to it but we've are restated as so little
  information -- in the past we have ignored it in the future, well.

Paolo: use cases behind this description. It would be nice at that
  level to just semantics

Kurt: if you create a lexicon and events a speech engine they should
  be able to work together. Different languages -- Swedish. We need to
  agree on what those phonemes are.

David: phonemic -- especially when you talk about the use of
  technology for. If you decode it as a given word thereby to remove
  the value of the human brain that can actually look the phonetic
  sequence and say I understand this word where the speech recognition
  can't. A scenario that's useful for very limited vocabulary.Debbie:

Debbie: standards process. I had a list of things where -- general
  categories of use cases. possible to do now. Things are very
  difficult to do now. Things that can be done now but not in the
  standard way. Things that can be done now but are not very
  efficient. Taking larger space -- not more than 15 characters
  because of space efficiency. I wanted to bring that out is another
  type of use case.

  Very interesting to distinguish between what our parts, for example
  the motion -- names of the motion, but there are many different
  vocabularies describing. They can agree there needs to be that
  vocabulary, but they have to think about different ways to
  accommodate different sets of emotions. It might be easy to agree
  that part of speech itself is useful.

  To distinguish between needing a feature and agreeing on different
  value of that feature

Paolo: agree. The point is who wants to do that.

David: comment to what you said -- standardizing features but not
  being able to standardize on the value. How has the W3C up to now
  dealt with this issue

Debbie: lots of different ways. Example, future like emotion
  vocabulary = URL that points to someone else's vocabulary.

  If you can agree in a few features you put this in and say it's
  extensible -- you can add your own. Other things have been tried.

Paolo: Registry -- needs a community behind the registry

Dan: IETF, example media types. They allow a media type -- specified
  process that you go through in order to get a new media type
  registered. Whether it's text, MPEG format etc. there is a way to
  get it defined. Its planned extensibility -- planned, somewhat
  extensive. It's new for W3C, but it's something we are doing or will
  be doing.

Debbie: it's a good way you can make progress without putting off the
  whole problem. It doesn't solve the complete problem but at least it
  moves you forward.

David: this is a real issue. Because the developers you want to be
  able to try everything. You need extensibility at the ultimate. Then
  deployment some kind of standard form would be valuable.

Kaz: SSML 1.1 strongly discussed that kind of extension -- registry.

Dan: W3C does not like the x-* style because it is not
  standard. Usually that means put whatever you want there's no
  standard for this, which is useful. You want to use the standard,
  but you want to use a format that either is proprietary -- okay
  because you're using within your own organization anyway. Provides a
  way for that. But the registry is the next level. Here's this thing,
  I want other people to know about this so they can use it.


Dan: use case representing Paul

  there is information that the dialogue system could be used to
  produce better spoken output access to that the speech synthesizer
  cannot determine on its own using current text standards. And there
  should be a way for that information to be passed on.

Kurt: why couldn't you specifically tag that -- doesn't matter what it
  means as long as you tag it prosidy

Dan:  just a use case -- no solution
  There is knowledge being lost

  it seems to me that that way of passing information requires
  customization -- that information will be interpreted differently
  for every TTS engine. And it will actually require either the
  dialogue system or the TTS engine to understand the other one. So
  this format -- we can use the role attribute, what information goes
  in there? If it's something generated by the dialogue system than
  the TTS engine needs to understand.

  There is no way to do that using the system -- whatever parses SSML
  or whatever parses.

  There has to be a contact established between the two about what the
  meaning of that is -- that will require some customization on the
  part of one or the other of those components.

David: I think we are ignoring the role of the technologist. As a TTS
  person -- it's fine statement that you're making that how much of
  the internal formation of the dialogue system

Dan: talking about role, there is no way for this information to be
  passed that does not require an explicit contract to be established
  between those two components -- the dialogue system or the TTS
  system which will require one or the other of them to make changes
  in what it does.

David's question -- there is a use case that I want to make sure gets
  captured. From the paper:

  Discourse information and information related to word
  disambiguation.

David: some thinking about the problem -- there's this one that
  produces lots of information and there's this one that depending on
  the design could use all of it are some of it -- we don't know. That
  seems an obvious statement -- if this group decides there is a way
  of making it available -- it looks like an extensible methodology so
  put it there and if I'm an engine person I use it or don't use
  it. It's not clear to me that it requires both sides to agree. I
  guess the dialogue side can describe.

Dan: the TTS system cares about the form of the word, the dialogue
  system cares about the function. So if the dialogue system says I
  have all this great information about the function of this word the
  TTS system will look at that and say that's nice but all I care
  about is the form of the word that the function. So I want you to
  give me the form of the word. By doing that you are requiring the
  dialogue system that only cares about the function of the word to
  care about the form. Or you enhance the TTS system so it understands
  how to map function into form. The category is what he propose which
  is a standardize way to create a mapping between function and form
  -- that doesn't exist today.

David that is the conclusion I was arriving at -- that's the piece
  that's missing.

Dan: for maximum operability among compliments you can't require
  anyone to understand the others format.

Raj: which one is he seeking -- standards, interpretation and a
  standard way, a way to pass that information on -- what is the
  mechanism that needs to be standardized. Which aspect?

Dan: there may already be a mechanism for passing the information.

Raj: with prosody information tags you could already today make it
  understood.

Dan: what system captures

Raj: prosody

Dan: key is whether you're going to require the dialogue system to
  generate that war whether there is a way for the dialogue system or
  the person is using to create a second mapping that does that for
  you so the dialogue system does not have to be responsible for the
  presiding information for the TTS engine.

David: a little more complicated. Dialogue -- all these functional
  tags as output. Speaking different languages. So you need a map. But
  you might want to argue that an engine needs to be there which takes
  input and that's it to the output. Standard at the interfaces. Right
  now it's happening -- each vendor is doing a little bit of that
  mapping right now.

Dan: it may not be simple -- it may be code that needs to do this
  mapping and because of the nature. Paul thinks we won't agree on how
  to draw those boundaries, because linguists don't agree. The issue
  is the interface.

Session4: Grammar, Semantics and Natural language processing

Moderator:
Rahul
Scribe:
Debbie

Conversational Syntax Requirements — Kurt Fuqua, Cambridge Mobile

[ Paper, Slides (TBD) ]

kurt: want to address mobile market
... different requirements, multiple apps, should have only one ASR to service multiple apps
... want info from one app to be available to other apps, e.g, antecedents for reference resolution
... use case: routing speech input to the right app?
... if user is talking to device
... traditional approach is to explicitly say what app you want, users won't accept this, so there should be a shared environment with a shared syntactic grammar
... need comprehensive grammar to do syntactic analysis, then semantic routing to identify which app is being used
... need to decouple grammar and vocabulary so that grammar can be scalable
... CFG's are not adequate for NL, no concord
... need agreement, features and inheritance
... need grammar, dynamic vocabulary, ... (see slides)

Extending SRGS to support more powerful and expressive grammars — Paolo Baggia, Loquendo / Debbie Dahl, Conversational Technologies / Jerry Carter, the Minerva Project

[ Paper, Slides ]

paolo: mixing dictation and grammar, context sensitive grammars
... use cases -- context-sensitive grammars,long distance dependencies, boolean constraints on semantics, enable and disable branches of grammar, mix DTMF and speech in the same input
... more natural language kinds of things, e.g. partial results, pass results on to other levels of analysis
... integration with other standards, IANA language registry, EMMA, metadata like age, gender, emotion, SNR

kurt: what about ISO639-3 language codes?

paolo: IANA does include updated language codes
... will check

Conversational Semantic Standards — Kurt Fuqua, Cambridge Mobile

[ Paper, Slides (TBD) ]

kurt: conversational semantic standards
... need for routing to support unified front end to all applications
... should be language-independent
... need semantic standard, rich, language-independent, inferenceable
...closed-class complete
... open-class extensible
... verbal valency

debbie: seems to be a specific solution

michael: there's a lot of syntax in this semantic representation

kurt: (gives Interling example)
... rich semantics enables proper representatation of NL, reduces complexity, allows language-independence

dan: need use cases, not sure what they are
... from syntax paper, I have multiple speech apps running concurrently on a device

kurt: must be reversible

david: need to specify "use case for what", e.g. want a developer to be able to develop apps in multiple languages or port applications
... automatic would be great, but simplifying is good, too

simona: value of abstracting semantics from rendering, e.g. for multimodal rendering, but this use case is very tied to language
... will propose more abstract semantics that's not tied to language

dan: hard to get a standard adopted that covers everything

kurt: no way to implement English auxiliary system in a generalized way

dan: this ties back to multiple speech apps running concurrently use case

paolo: the point is what do you want to do with SRGS, it wasn't meant to be a general grammar formalism, but just a way to constrain recognizer

kurt: now we want to get a parse tree, so why shouldn't we just have one grammar

paolo: SRGS has a logical parse structure that could be exposed
... we should abstract from the solution, because there might be multiple solutions

simona: as soon as you're going after rich semantics you're looking at natural language applications, no causal linking between SRGS and another grammar formalism

kurt: could come up with falsely constrained recognition

paolo: actual speech isn't like a grammar of formal English

david: whatever you introduce as a formal construct you should have a statistical layer to make the system robust
... whenever we introduce complexity, you have to ask the question of the ROI
... you have to show the value, we all agree that we can make things more complex

michael: layers we're talking about here are collapsed in ngram models. you're getting syntax, semantics, and pragmatics all at once from the annotator

dan: thinks paolo/debbie's use case is to use technology improvements

paolo: no standards for SLM

debbie: we used to think that some technologies were impossible that now might be reasonable because of changes in computational efficiency

kim: support for low tech examples, support for user choice is low-tech but extremely useful
... MIT Common Sense database, e.g. "bark" near a tree probably not talking about a dog, common-sense facts, some studies improved speech with common sense but improved perception of speech was even more
... multiple application use case -- clients with CP would like to control entire device with speech, people would like to control device with speech as well as applications

david: should consider anything that can improve conversational applications, not just technology
... can use other things, like common sense, technology,

simona: would this workshop define standard commands for doing things

paolo: ETSI did that for defining simple commands

david: those standards only happen over a long period of time
... as behavior is adopted by people over time

paolo: the accessibility group could promote guidelines for the speech interface

simona: running multiple apps would benefit from standard commands

kim: could come up with a default and allow people to change it
... different users have different requirements

dan: paolo's second use case is to update SRGS to integrate better with new specifications

david: too many use cases, what is missing in the language to support use cases?
... we need to list what is missing to support these things
... should be able to point to an implementation and say where the limitations are

dan: e.g. Chinese people added "tone" element in SSML

michael: use case for interspersing DTMF, could make that more general by saying that you could intersperse multiple modalities

Session4 (continued): Grammar, Semantics and Natural language processing

Moderator:
Michael
Scribe:
Paolo

The Future of Advanced Dialogue Applications — Simona Gandrabur, Nuance

[ Paper, Slides, ]

Simona: Challenge to find Use Cases
... Motivation behind, shift change in application from traditionally
... directed prompt single-slot: create frustration
... beyond call routing applications
... Future is address complex problems and succesfull integrate other modalities
... Examples: What's my first meeting with Peter?
... negotiation: I want to Montreal sometime tomorrw to be back at 6pm
... Problem solving (slide 3)
... dynamic call flow, multi-turn decision logic.
... Deep NLU (slide 4)
... ASR SRGS insufficient need generic language model
... I don't want traditonal NLU
... riche semantic representation beyond slot-value pair
... example: quantifier
... example: relation / frames (tomorrw but before .. after ..)
... standard ontologies / knowledge-bases
... E.g. apple has a standard commonly knowledge bases
... Multimodality (slide 5)
... EMMA: can't wait it to be a requirement
... abstract semantics: independent modality, work with common concepts
... e.g. don't care modality, I just need a concept
... Abstract funcionality: independent modality can we define a limited common I/O
... Multi-source input: speech/text/touch/history(conversation,appplication,user profile, ...)
... Multi source of confidence to be combined
... Complexity (slide 6)
... Maximise re-use: ontoloigies, VXML, grammars
... R&D agility (slide 7)
... easy to vehiculate vendor-specific information
... need to experiment and hack if not present in standard

<claps>

Dan: Mention competition with SmartPhones

Simona: you touch, speech is not needed

Michael: If you function of speech is to select, type few character and restrict

simona: Voice and language has competition more

kurt: Express desire of ontology, something in mind

simona: in another startup, built on top of wordnet
... have these resources general available

Extension of SRGS and SISR — Raj Tumuluri, Openstream

[ Paper ]

raj: not legal clearance to speak on details
... describe the application context. Broker solutions
... company merging and remerging.
... Normal grammar were not adequate, we need portable SLMs across vendors
... Second requirements: Programmatically assign weight, but spec is deliberately ambiguous
... we need to have a programmatically way to assign weights
... It is very difficult for us to manage that
... especially in mobile, where the speed is important
... There could be other way, but weights will be fine

<ddahl> the paper that paolo and I gave talked about assigning weights to regions, but I think Raj is talking about assigning weights to specific words as well

raj: Reqs: Enabling and disabling grammar
... based on the context disable certain grammars to increase accuracy
... Reqs: intermediate results will be very nice, to start showing while coming.
... Last point: personalization
... It could be that the choice of personal to global address book. This is personalization.
... Have it louder.
... Acoustic profile, based on that we will change the parameters.
... Looked around, difficult move from vendor to vendor.
... Combine speech to DTMF, it can be another media stream, not just combine DTMF
... Give application developer perspectives to help apply technology in different contexts.

<claps>

David: Features on multiple devices or not

raj: multiple engines

Dan: You give volume as an example
... In VoiceXML 3.0 ...

raj: It is not here today.

dan: Got it.

kurt: Multiple applications, how you deal input?

raj: We identify all the applications (news readaer, etc)
... Depending on what applications are active, there is a policy to keep active
... It will activate grammar, but we might be wrong
... Promotion of grammar, pluggabilities, these days are configurations
... There are thing you can keep active ...

kurt: can you extend no, of applications?

raj: You can but it is not guarantee.

kurt: Sharing across apps

raj: No

Use Cases and Requirements for New Models of Human Language to Support Mobile Conversational Systems — Chiori Hori, NICT

[ Paper, Slides ]

Chiori: NICT is born in 2006 from ATR speech group
... 25 years of speech translation (9 speech to speech)
... Target is travel domain
... Collect speech data, annotate tags,
... Goals of SDS
... accept spotaneous speech and mimic dialog behaviour
... Corpus-based dialog management
... Our is a general approach statistical
... Scenario WFST
... Origin destination FST
... Sytem navigate and asks confirmation and questions
... Spoken Language Understanding side
... Very simple word-class for word Kyoto and Tokyo, and origin destination
... Construct Hum-Hum dialog using to contruct WFST
... combine two WFST together
... for spontaneous speech we can combine two oner for speech input and then speech to concept
... see the context, the slot filling
... Problems:
... 1. We need Context Sensitive ASR, change SLM time to time
... 2. Separation of ASR and NLU
... add weights, label semantic annotation on ASR result, weight for state
... 3. SLU using WFST

<claps>

michael: Not just model precompiled, but dynamic change of the model

chiori: Yes everything has state

simona: can go in the VXML

chiori: you will use SLM

michael: related to that, existing SRGS, in a particulare entity is relevant
... as it stands you can have rules that can change

paolo: syntactically you can but engine might not support

simona: agrees

paolo: to simona, need higher VXML dialog communicate a state to change slm

simona: yes

kaz: is it implemented?

<kaz> [ actually, I asked more specifically about whether this is not only a research prototype but an actual product/service or not ]

chiori: there are implementation on iPhone for state-based, we want to go to spontaneous
... Should any SLM be incorporated in WFST

simona: can explode?

chiori: It can be handled, recognizes 2.000.000 words
... robust, but users can break shifting the topics

paolo: Chiori suggest there is need of a standard SLM language to be able to exchange

michael: <nods>
... Portable SLM, another pieces is another approach to dialog
... it might difficult to agree, but we might think a correct interface damong different approaches

simona: targeted to slm

michael: either one

david: you have an idea?

michael: No but it is interesting to discuss
... problem of shift of topic, a lot of weight to respond question, but more is possible.

chiori: In our case we use back-off, unigram

paolo: n-gram on dialog states?

chiori: yes, we need an history of dialog to model it
... User can input, but the system never ask the same question
... simple but effective strategy

david: what are the use cases?

kim: I think raj was talking of control, people frustrated want use the comments either application specific
... nice to have user control have control on grammar

michael: I can see that, support many appls and system keep answering in wrong context
... you can switch off

dan: turn off navigation when you are close home, then on highway you need

kurt: try to track what user will know to avoid repeating.

michael: it is interesting idea to people give commands to activate or deactivate

david: Problem solver, question is: what aspect cannot be implemented with existing standard

simona: complexity of the state, miss all the possible states and strategies
... description of the states, ..., all combinations SCXML can put in parallel
... but can be more complex and not able to represent in state chart

dan: you can implement your strategy

simona: I'll use Java

david: do we need a new language or extension of some?

simona: Trivial VXML to play and recognize, then all the interface is on the server

david: Anything that is not directed dialog, VXML becomes in an interface

simona: my point was that I don't think we know how ...
... can I use SCXML? No I can't

david: My problem is that I cannot fill the gap of standard and implementation

dan: We are aware of it in W3C. Even V3 makes easier the state based dialog, but even the transition controller
... only language that we have is SCXML. We might not require to use it.
... You might list in VXML your information, it is still outside.

simona: Why should I split it?

dan: If it can help for high level understanding, allow for transition controllers do in a generic way people might use it.
... Whether it will be sophisticated I don't know.
... About gap, there is a big gap to drove state, there is a big gap of rule-based. What I hear that there is desire
... that you won't manually manage individual states.

david: states?

dan: Practically speaking the state is where the grammar are the same.

david: coming back to problem of Simona, I see following passes. We benefit separating complexity of dialog and understanding.
... I think of problem solving, I can't even think about a direct dialog implementation of figuring out the problem solving.
... IT help desk no context, go in the whole exercise to extract info to people who has no idea.
... Today this is a huge exercise to turn it in standards.

dan: VoiceXML 3 always start from dialog flow presentation framework. It is the view. Not complex problem solving.
... It is there to speak and listening. There might be more ways.
... That separation might be appropriate: dialog (back and forth), application in generative (world knowledge)
... The last is where we have no standard at all.

david: There is a third piece: in a given node, I want to be able to address multiple slots at the same time.

paolo: a sort of, can return several slots and fill different fields

david: each intention has more information and intent.Is it allowed

simona: the same question I put in meaning representation: independent slots with predicates.
... this is not possible today

michael: EMMA is only the container

dan: You can represent words, not support

debbie: multi intent utterance, I see EMMA very nested. You could represent them in EMMA, e.g. tree structure.
... The problem is how to unpack it in VXML

michael: there is a problem of correlating data

david: that is dialog part
... My experience that when we try it, we jump out VXML

michael: There is a higher level: bottom line level is dialog in any point I can give one piece of information,
... second you specify several at once, next level (multi intent) series of separate intentions

simona: question of complex sematics goes beyond to the multi intention

dan: current standards we are mixing several things
... standards offer on input: how system can know what people say, currently are CFG grammars,
... if you can represent complex grammar and then map results to meaning in ECMA, here there is
... an interesting question on what happen next, which entity decides what happen next
... It is likely not your VXML code. It has to extract meaning and precess them, these are submitted to a server
... to process and say something else. Very lilttle computation is done in VXML.

david: this is dialog

michael: One are capabilities of dialog system, another is having more sophistication within the way you represent semantic intent
... disjunctions (no sushi today), another one is the ability to maintain multiple hypotheses from
... recognition and understanding (if said X or Y, and then must be Y)
... Can we create a system able to go back?

david: I'd like to restate the challenge: when we build direct dialog we enforce simplicity and clarity and we find way
... to do repairs. When you start to allow multiple slots, then the process becomes very complex.
... It becomes a difficulty of application design, but I still think that we come to multiple intent/slots, the standard
... grammar we have it is not sufficient. At least in our experience, grammar combines, how to assemble things together,..

simona: I'm not sure that it is easily handle, and I'm not sure we can represent the output.

dan: This is EMMA. Can an author effectively describe the language model to do the capturing the semantic information

simona: s there a powerful way to represent it.

david: This is our experience in last few years applications.

dan: This two questions are coming to the use cases.

simona: let's take a complex sentence:
... e.g. " Can I leave today in the afternoon or tomorrow morning, and arrive before 3 pm?"

michael: Part of your grammar to recognize citynames, you can use SRGS, for multiple you go to SLM, then multiintent who know?
... Dealing with utterance with complex semantics there are not good models

simona: there are three levels

dan: Is not representable in EMMA?

michael: that is how to represent it in XML, not EMMA.
... maybe how broke SRGS will become.

simona: yes

debbie: EMMA might provide more guidelines on how to represent.

dan: If it is XML it will be transported.

michael: There is not construct to represent ambiguities

dan: We will need to come back to this and find the use cases

<kaz> david: we need to clarify: scenario -> implementation -> issue

david: I was not aware of what to expect. Also the presenters themselves should clarify use cases because they know about their idea best

dan: proposal to find more time to think on use cases

david: four question presenters express
... 1. Use scenarios
... 2. This is my implementation
... 3. What standard I used or failed (limitation of standards)
... 4. Recommendations on extend standards
... Suggest in Section6 take one scenario and do 1/2/3/4

<kaz> [ afternoon break ]

Session5: Architectures and Dialog System Integration

Moderator:
David
Scribe:
Matt

Conversational Architecture Requirements — Kurt Fuqua, Cambridge Mobile

[ Paper, Slides (TBD) ]

kurt: When I talk about architecture here, I'm not referring to whether it's in the cloud or on device, etc.
... Single sentence means a sentence that applies to multiple applications
... Developers won't be able to develop this level of applications, so semantics are required

Each of the levels of engines pass information between them. Inferencing level passes meaning representation. Syntactic to morphologic would be a parse tree. And Morphologic to prosodic would require some transfer too.

David: What do you mean by multiple applications?

Are they just there or are they spawned?

kurt: I mean they are concurrently active. Probably an agent listening to the entire conversation, which then invokes the right one. The agent would have to listen to everything though and convey the semantics to the next step.

Simona; The agent would become huge, having to track everything plus history, etc., though

kurt: I don't know how else to do it.

Simona: How about each app keeps track of it's own needs?

David: We'll come back to this during the question section.

Beyond the Form Interpretation Algorithm — Rahul Akolkar, IBM

[ Paper, Slides ]

Rahul is next with "Beyond the Form Interpretation Algorithm"

rahul: Browser based, markup driven multimodal applications reusing existing Web developer paradigms.

VoiceXML 2.1 had the FIA, but any complicated control of the dialog probably required dynamic generation on the server side and not in VoiceXML.

VoiceXML 3 requires richer dialog management within the browser itself, in combination with other modalities, and to try to align the voice and visual webs.

rahul: Some simple use cases:

Visit form items in reverse document order -- FIA can't handle it

XHTML + VoiceXML

using just native VoiceXML 3.0 constructs and eventing.

rahul: The basis for the richer dialog management is recursive MVC or as we sometimes call it DFP.

We have transition controllers which exist at multiple levels of dialog granularity that provides localized control over transitions.

We use DOM events as the glue. This gives us well defined integrations as well as application level integrations.

[[example of a TC written in SCXML within a v3 form using DOM events to transition to another state ]]

[[ compound document example that mixes XHTML and VoiceXML using DOM events as glue ]]

Discussion

David: Opening for questions now.

Michael: This multiple application sentence bits can be looked at in a few ways: you're interacting with many applications, or you're interacting with just one.

kurt: Nothing in my presentation was about this agent, I was just talking about how I might implement it based on a question.
... You don't know necessarily what you're going to be facing as the 'agent' in this scenario, you are going to be loading things dynamically.

Michael: That's one way you could do it, or each application could have it's own grammar that's for that app, and then there's a meta level that handles data exchange with each application.

kurt: The complexity comes from tracking the conversation. We have to keep track of what was said. These rules are language specific.
... The only way I know how this would work would be a rich lexicon.

Simona: So what if we had a requirement on each application that they each had a mode, where they didn't act on input, but that they return a score about how relevant parts of the utterance are. Then there'd be agent based AI programming that decides which bits go where.
... All applications are running all the time, with each one maintaining it's own history.

David: You've suggested a solution, but I want to clear up the problem.
... It looks like each application has business logic, some data, a grammar/dialog/interface for the upper layer/agent, then there's a global application layer on top.
... The fact that the application layer goes and calls the interface on the telephone @@.
... It's similar to call routing.
... Looking at it that way, where is the problem? Why can't you do it with current standards?

kurt: Each of the apps are written by different companies.

David: But what about them is written by a different company?

kurt: Yes.

David: Then this is the tricky part. If I'm at the top, I almost have to import the grammars of each.

Simona: No! If you have to do that you're doomed. We don't have a standard currently that has a "read this input, tell me what you can handle".

David: I understand, but there are complexities. If we can abstract it out, the problem is that "grammars exist, you're having difficulties seeing inside and then you want a layer on top that has to make them work in unison" -- that's the problem.

burn: The assumption is that the higher level agent can do this for any set of apps.

kurt: They're all written for English say, and now there are three parts that have to be aggregated.

David: Kurt's case is not dynamic, his is twelve apps that he knows about.

Kurt: We have to aggregate the vocabularies of each app at load time.

The grammars should be derived from the same grammar. Say, a master grammar of English that each sub-grammar is derived from.

(Also a dynamic lexicon at load time too.)

Then there's the dynamic routing.

I'm assuming one recognizer running, not 12.

Kim: Example?

kurt: "Hold my calls while I am at the restaurant?"

burn: What do you envision having to be done to add "hold my calls, except for my wife"?

matt: Was it clear to everyone that the example means "Hold my calls" goes to the phone and "at the restaurant" talks to say, the GPS app?

kim: Yes.

Michael: This seems like a very difficult/impossible problem to solve.

David: You almost need to have something like SOA that describes to the outside world information about it's behavior.
... Now once you have this information available, can you imagine being able to snap all of this together?
... Without the first part, the knowledge about the application, this can't be done. After that, to make it dynamic, to reconfigure itself, that's a very hard problem.

Raj: Let's make it less nebulous. Let's use a limited example.
... What we do is each app defines a top level grammar, not everything. That grammar gets promoted to the top level at different times.
... Otherwise it's NP hard.

Kurt: No, it's not, we are doing it.

David: I don't know how you touch that for dialogs.
... I think Raj is saying you know what you're starting with, and that's fine.

burn: If the agent is written with a mechanism to say "the agent will only talk to these 12 apps" that's different.

Raj: So, make it generic. Say, you add another app, it could be plugged in.

kurt: Dan asked if I'm talking about a generic framework for any app. I am saying yes, as long as the grammar is a subset of the master grammar and the @@ is a subset of the lexicon.

David: How do you do that? That says you know the union.

burn: He has stated that his universe is all of English.

kurt: No, but a large subset, that's fairly comprehensive.

David: I want to talk about two scenarios: what it is and how complex.
... Let's set difficulty aside and clearly define the problem.

kurt: The problem isn't 12 vs 200 node grammar, but whether the grammar is reversible and @@. Right now SRGS isn't sufficient as it doesn't allow me to do inversion, it's got hardcoded ?? in it, etc.

ddahl: In the scenario, let's talk about this: everything is configured, etc, an utterance comes in, what is given to the apps to do their job?

kurt: all register themselves at startup, then there's a merged grammar that has been dynamically configured, but other than that it's just an ordinary grammar.
... Once the parse is done, I've got a parse tree.

ddahl: Before any app works, we already have a parse tree.

paolo: How general is the grammar?

kurt: I don't want an over generative grammar. I want tags that let me say things like: I only want a transitive verb, or a noun, etc. That makes the universe much smaller and a much more tractable problem.

ddahl; So, once you have a parse tree what happens?

kurt: You take the parse tree, transform it step by step and convert it into Interling.

ddahl: This happens before the apps are working?

kurt: Correct.
... I have a comprehensive grammar for English, then a translation grammar that translates the parse tree to Interling. Then I have semantics for determining routing.

burn: Where do the semantics for routing come from?

Kurt: The apps tell me.

Simona: Based on score?

Kurt: No, logical.

David: I'm back a few steps. Someone says something. The ASR comes up with the "top end text".
... Just to do that, are you saying any intelligence is going to be applied to this level to the recognizer as to what the scope is? 2 or 3 apps active or you just have a huge grammar or language model that you've built and the recognizer is on it's own to give you the text?

kurt: It's not huge, it's kind of small actually. The great thing is that we can hone our grammar because we can have tags and the formalism for this means it's not overgenerative.
... If we have a formalism that allows us to write comprehensive grammar. I need a start mechanism, tags, etc.
... The only reason I am not using SRGS, is because I don't have those things.

David: If we're talking text, what would your issues be?

kurt: If I did it, I couldn't run against SRGS.

burn: He doesn't expect any restraints. Effectively you "or" those grammars together.

David: For the reco part, the union of the grammars are put in there and the ASR goes to work and gives you 1 or the n-best.
... The reco isn't the issue.

kurt: I can do the unionization, I just need a better formalism.

Michael: On one hand we're talking about multiple apps, but on another we're talking about more complex speech, handling @@ and ellipses, etc.
... I've never seen that done well within a single domain, much less over more than one.
... Being able to do that, then also adding being able to add a dynamic app in, I can't imagine it's done.
... For instance Siri, it fuses many apps. Tries to juggle between a few apps. There's a little anaphora. Imagine that app, if all the underlying things are under that.

David: Siri is in VoiceXML?

Simona: No, no no...
... Siri builds all it's own building blocks. Making it then work for any arbitrary speech app.

kurt: But we're doing that now, we're forcing people to use SRGS, we can improve it.

burn: I can see for combinations, that we have multiple grammars or-ed together.
... But your example was overlapping bits of grammars. Neither of the or-ed grammars would match.

kurt: Exactly, that's my point. That's why you can't leave it up to the apps themselves to write their own grammars.
... Yet, they've got all the functionality to implement it.

Michael: In terms of standardization, I'd like to see some existence proofs to better understand how they might interoperate. Those individual pieces are not there yet.

David: That's why I jumped on Siri, that's the lowest denominator of the things we've discussed. Setting aside the quality, from the standard point of view, that's a good example to figure out what standards they could have used, what was missing etc.
... Anything else takes it to the next level of complexity.

David; If I'm designing an application that has to do with time, and someone else is doing an app that has something to do with time, then how do you resolve that conflict?

kurt: Those are all post-parse problems, those are complex, but I'd like us to have a formalism that allows us to parse.

Simona: Another example: deep NLU cannot only be done by parsing your text. You need prior probabilities, data, profile, in order to chose amongst all the possible parses, which may be full or partial to chose which is appropriate given context, etc.
... I can't make the decision purely on the parse tree.
... I'll parse it, match it with my data and then decide.

kurt: But you wouldn't parse it?

Simona: Yes, but even the parse could be biased by the data.

David: I think the area of the problem you're proposing is real, and going to become more and more real as deployment happens.
... The problem is real, and it has many layers of complexity.
... Based on this feedback, maybe you could make it more concrete in different layers.

matt: I think he's suggesting that we start with the parse formalism, right?

kurt: Yes.

David: But the pushback then is how does it fit in the bigger picture.

burn: You've said it's or-ing them together, and then you've said it's not just that.

kurt: If they're both derivative of the master grammar, then it works.

Michael: You might want it such that the lexicons can be unioned together.

kurt: We could do it with SRGS, but it doesn't handle dynamic stuff and scalable.

David: Is it clear that this statement that SRGS isn't dynamic?

burn: Dynamic and scalable, I don't know what those words mean within the context of SRGS.

Michael: If we want to dig into it, we should separate between grammatical categories for example. SRGS is a formalism for writing rules and expressing things about sequences. There's nothing in there dictating rule names.
... It is the case that the things Kurt would want to do is capture agreement on those. The only way to do that in SRGS is to extend the atomic names of rules. @@

kurt: 15 rules.

Michael: If you wanted to distinguish between say, singular and plural nouns, then you'd have to have a grammar with n singular nouns listed and then n plural nouns listed, etc. I don't think you'll ever get anywhere trying to define the categories in a universal way.
... That would involve, even in the world of rule based grammars, to get people to agree and accept.
... I think we could define in SRGS a mechanism beyond atomic categories into more complex categories.

David: I think Kurt's problem is real, but I think most of us feel the problem is complex and difficult. I would ask that you figure out how to remove many layers of complexity.
... I would need it to be a lot more specific in order to see the real problem we are talking about.
... Capture a simple example and address it tomorrow in a track.

kurt: I'm glad to do that and I'll have it ready for tomorrow.
... To be clear, we're talking about extensions to the semantic formalism.

David: Yes, but make sure it doesn't bring in telling application developers to conform for instance. Make a lot of assumptions, you are Siri for instance, putting together four applications, what has to be done, what extensions, what new languages have to be introduced?

burn: Better, just two.

David: Regarding Rahul, I don't see the scenario.

What is the new scenario that would require a change?

Rahul: Primarily in the way VoiceXML does two things:
... It's approach to the order in which presentation elements manifest themselves (at any granularity). What is the component that is controlling that layer?
... The second part is the cases where you want to use some of the "events" out of speech modality in other modalities, and use them in meaningful ways in other modalities, then you need an eventing model that other modalities can listen to.

burn: Why do we need a component?

David: Isn't the answer the browser?

rahul: At the programming model level, how do you describe it? Dialog manager is this nebulous term. There is some level of intelligence being used for what information you want to collect. There's no reason that logic should have to reside on the server.
... The use case is developing richer implications without having to rely on the server side.

burn: That's a use case: "applications with richer, more flexible selection of presentation units within the language itself"

David: But I keep hearing from you a division of role between server and client.

rahul: This is about enabling some intelligence to reside on the client, rather than within the FIA or the server.

David: Is it available today? Or in 3.0?

rahul: The target is 3.0, not available today.

David: So there is a need for the logic on the client and 3.0 will have it. Is there anything else that this would require?

rahul: There is no magic here. There's a lot of inherent complexity in what the user is saying, and someone has to figure out what that means, and how you fill slots based on the utterance you receive. Most of that isn't covered in my paper.
... Assuming you could take what the user said and translate it into a given number of slots, then I can provide you with greater flexibility without going to the server.

David: So you don't see any specific short coming of the list of items that were put up this morning?

rahul: yes, in a way they are orthogonal.
... SISR for instance, I assume the semantic interpretation is juts done.

David: Since you are moving some aspect to the client, then what are the dependencies?

rahul: If all you are using VoiceXML for is a thin shim to play a prompt and retrieve a response, then you will be able to use with these improvements be able to generate a more sophisticated VoiceXML dialog.

ddahl: I think your proposal is saying that the language that needs to change is VoiceXML, and that the changes are already on the path to support what you want to do, then the interesting question is if there is something else that VoiceXML needs that isn't already on the path?

matt: I think we're not actually supporting some of the use cases that rahul mentioned. I don't think what is already planned adequately services the use case of bringing the development in line in a way that that common web developers will be able to develop voice web apps.

ddahl: You're saying that what we've said should be a template for how to describe a use case. You've tried to implement things, flexible dialogs you had in mind, you had to do something else using server logic, etc. That's exactly what we're saying motivates changes to the standards.

rahul; Yes, that's part of the difficulty again, once you head over to the server side and you don't know what it's doing, then it's even more difficult to compose multiple pieces together. Who knows what's in there? It's a black box.

rahul: If you have two VoiceXML documents authored using TCs then those are things you could envision making sense of them and being able to combine them.

burn: You listed two use cases one that we worked in detail on.
... One of the use cases was specifically about combining XHTML and VoiceXML. That use case I understand. Is that a sufficient description of a use case for people here?
... We want to make VoiceXML documents that can interact with XHTML documents -- that's the use case at a minimum.

David: And that integration is at an event level.

burn: That's just one possible way to do it and what we are pursuing.

rahul: And my proposal is that that is the right way to do it.
... For example the XHTML + Voice pizza example. As you say stuff ordering a pizza it appears on the screen, etc.
... There was magic involved in X+V, there was no real contract between the XHTML page and the voice bits.
... A VoiceXML 3.0 presentation element can be activated by receiving an event. And in the other direction, when a form field is filled it could now generate a complete event that could be listened for by anyone in the ancestors and take whatever action it needs to take.
... Could be any markup, doesn't have to be XHTML, could be SVG or something.

David: Both of these ideas have been in the community for a few years and are being pursued as we speak.
... So that's good, but is there any badness?

rahul: Apart from the time it's taking? No.

David: Not minimizing your paper, but I think your stuff is in good hands right now.

[ short break ]

Session6: Planning session for the topics for Day2

Moderator:
Dan
Scribe:
Kaz

dan: question
... anybody not sure what 4 template questions mean?
... if anybody not sure, we can walkthrough one example
... pick one example up
... e.g. the one from Paolo and Debbie
... or the one from Simona

( take Simona's presentation as an example )

( Dan puts "Sample use case template for each presenter to use" as the title)

David: 1. Describe a use schenario:
... The problem of the IT Help Desk.
... Person says "My laptop is broken. Help me fix it."
... 2. Describe your implementation:

raj: points out something

debbie: interoperability is out of scope here

david: use scenario doesn't mean "user interaction"
... use scenario for each person's use case

simona: 2. Describe your implementation: Interfacing with a rule-based diagnostic engine.
... possible diagnosis
... we don't describe the (diagnosis) logic within VoiceXML
... that's one of the problems

dan: put the folllowing:
... 2. Describe your implementation:
... A rule-based diagnostic engine is used to do conversational logic, with VoiceXML used only as a means of playing audio to the caller and returning a recognized result.

simona: fine

dan: and put the following:
... 3. Why were you not able to use only existing standards to accomplish this?

simona: optimizing strategy

david: we couldn't see a direct methodology
... "it's not broken"
... the listener may say "maybe the problem is something else"
... it's issue of dialog management
... the question is on the semantic interpretation as well

simona: the system pinpoints the cause based on the dialog exchange

david: too many situations

michael: interactin of dialog logics
... what is underlying problem?
... first to do is collecting symptoms
... e.g., I need to ask another question to clarify the situation

simona: yes, need to ask further questions to pinpoint the problem
... we can not do that FSM which VoiceXML currently provides

dan: put the following:
... 3. Why were you not able to use only existing standards to accomplish this?
... This is an issue of dialog management.
... We could not determine how to encode in VoiceXML the reasoning needed to work towards identifying and aligning symptoms with a solution. The best approaches to this are data-driven and probabilistic in nature.

( dan asks if the above is ok )

simona: ok

( dan moves forward to item 4 )

dan: put the following:
... 4. What might you suggest could be standardized?
... I don't know
... there are lots of non-comercial systems in research institutes etc.

david: the problem is...
... I have to iterate questions to clarify the problem

dan: ok

( go to the written-up description )

dan: there are many customer dependent approaches

david: we needed to put complicated knowledge into the system

dan: start to add some description to the 3 (=issues)

michael: there are lot of technologies are involved even in kitchen sink problem

david: complexity glows
... even this case is beyond multi-intent multi-slot

dan: (updates the description a bit)
... In addition, it was not clear how to map within VoiceXML from the output of the natural language piece (e.g., SLM + robust parsing + statistical semantic models) to the reasoning engine. We could not represent any aspect of the state other than just the specific slots filled during recognhition

david: we could not do those things

dan: not clear how to map VoiceXML

david: not clear how to implement with existing standards

michael: it's not coming from just one place

dan: is this (=the description on the screen) sufficient as an example?

michael: one question is:
... would we get a benefit from having an extension to the standard to make it easier to maintain the set of hypotheses?
... not of recognition hypotheses
... but hypotheses as to what is underlying...

dan: will send out the writtenup text

( Dan will send the writtenup to Kaz, and Kaz will distribute it to the attendees )

Sample use case template for each presenter to use:

1. Describe a situation that demonstrates the issue:

The problem of the IT Help Desk. Person says "My laptop is broken. Help me fix it."

2. Describe your implementation:

A rule-based diagnostic engine is used to do conversational logic, with VoiceXML used only as a means of playing audio to the caller and returning a recognized result.

3. Why were you not able to use only existing standards to accomplish this?

This is an issue of dialog management. We could not determine how to encode in VoiceXML the reasoning needed to work towards identifying and aligning symptoms with a solution. The best approaches to this are data-driven and probabilistic in nature. In addition, it was not clear how to implement with existing standards from the output of the natural language piece (SLM + robust parsing + statistical semantic models) to the reasoning engine. We could not represent any aspect of the state other than just the specific slots filled during recognition.

4. What might you suggest could be standardized?

I don't know.

dan: gives instructions for the exercise
... please get prepared for the exercise with your own idea

(Dan asks attendees for their own use cases)

paolo: Cover SLM inside the standards

rahul: Browser-based dialog management
... Interaction with other modalities in the browser

kurt: Syntactic formalism
... Feature inheritance and concord
... Start/resume mechanism
... Non-increment mechanism

simona: Problem solving
... Richer semantic representation
... Abstract semantics
... R&D agility

david: integrated conversational interface to multiple applications/

raj: Programmatically assigning weights, enabling/disabiling parts of grammars

kim: Speech commands are hard to remember
... It's difficult to get users to correct speech commands
... It frustrates people when the computer gets things wrong that they would not get wrong
... People who can't use their hands need to be able to use speech for all input

chiori: Multiple hypotheses of ASR and Spoken Language understanding manipulation based on dialog context context for dialog management

michael: Grammars or models that support parsing and understanding of inputs distributed over multiple modes

debbie: Being able to say something that includes a grammar-recognized phrase as well as SLM
... Being able to take into acocount characteristics of the user such as age, gender and emotional state in making dialog decisions

paolo: Pass features of different kinds to TTS, ASR where needed

kurt: Reversible morphology engine and formalism
... Grammatical tags

debbie: Dictation

Rough list of use cases to discuss on Day2:

(dan explains what to do tomorrow)

dan: not how you solve, but why do you need to create new standards
... e.g., Chinese people wanted to add <tone> element for SSML
... but it would have not been appropriate to just create a new <tone> element
... people tend to solve their problems using their ways...

debbie: e.g., there are many possible solutions when I'm tired :)

dan: use case is "You're tired"

david: go to bed could be a solution

[ Day1 ends ]

( Reception + Dinner !!! )

Session7: Breakout sessions

Break discussions by the following teams

Team A:
Paolo, Kurt, Dan (Spokesperson=Paolo, Kurt)
Team B:
Debbie, Raj, Kim, Matt (Spokesperson=Debbie, Raj, Kim)
Team C:
Simona, Chiori, Teruhisa, Kaz (Spokesperson=Simona, Teruhisa)

Session8: Reports from breakout sessions

Moderator:
Kurt
Scribe:
Simona/Paolo

Presentation of Use Cases

UC1:

UC1
(click to enlarge)

Applications need to be sensitive to certain (arbitrary, dynamically
extracted) features, s.a gender, age, etc. Example : adjust voice,
phrasing, etc. based on those features.  Current limitation : current
VSML infrastructure only allows words/interpretations/confidence are
standard requirements.  We need a place to these info somewhere so
that they get transmitted.
UC2:

UC2
(click to enlarge)

Interaction between lexicon and grammar don't include addition info
such as POS, Grammatical, or other: Example : I'd like to annotate a
name with the region (location) to influence pronunciation.
UC3:

UC3
(click to enlarge)

Dialogue system that contains discourse and WSD that could be used to
improve rendering.  Example : "record" (noun vs. vb). Need a mechanism
to convey the information between those components without having to
modify the categorization either of the dialogue or the synthesis
system.
UC4:

[ The upper in the following picture is UC4. ]
UC4
(click to enlarge)

Phomeme sets.  an author can't create a component (app, lexicon, asr
engine, tts engine, etc.) which is assured to be interoperable with
other components in the terms of the phoneme set. The author should be
able to use and specify a pre-defined standard phoneme set.
UC5:

[ The lower in the following picture is UC5. ]
UC5
(click to enlarge)

Syntactic Formalism.  Today an author cannot create a syntactic
grammar for comprehensive NL because e the formalism lacks features
inheritance, POS terminals, concord, inversion, etc.

A new formalism should be created.
UC6:

[ The upper in the following picture is UC6. ]
UC6
(click to enlarge)

Morphology engine

Kim: what would it allow authors to do?  Kurt : could use a higher
level abstraction to create applications or engines
UC7:

[ The lower in the following picture is UC7. ]
UC7
(click to enlarge)

Shared Syntactic Grammars (for simultaneously running applications)
(please fill in UC description)
UC8:

UC8
(click to enlarge)

(from Kim) Focus change - users need a way to tell the device how to
control focus. (speech commands don't automatically reorient mouse.

There's no standard way to do this.
UC9:

UC9
(click to enlarge)

(from Kim) Users are afraid of speech : users need a way to undo both
actions and text events
UC10:

UC10
(click to enlarge)

mix constrained and unconstrained recognition.

Debbie : this can't be done today within SRGS
UC11:

UC11
(click to enlarge)

Combine reco constraints when multiple apps are active simultaneously
and transfer focus.

Kurt: What are reco constraints? 

Raj: A more general name for "grammar".Raj proposed a mechanism to
  solve this by exporting top-level reco grammars.

Simona remark : Raj's solution proposed is not necessarily the only
  solution for that problem.
UC12:

UC12
(click to enlarge)

Kim: need for resolving conflicting commands, as well. Need for a way
  to organize and share commands, for the user to have a way to
  remember and prioritize commands. So , one solution might be to have
  user-configuration for conflict resolution. Users can't find them,
  adjust, organize them, or share them.
UC13:

UC13
(click to enlarge)

Enabling/Disabling grammars/reco constraints. should be able to be
context specific (Activated/deactivated/weighted) Where context, means
anything (including user profile, history, etc)
UC14:

UC14
(click to enlarge)

Dynamic on the fly activation deactivation or combination of any SRGS
or SLM grammars

There are application that in intelligent conversation combine
open-ended and restricted language, so we need a mechanism to specify
how to combine any recognition constraint. Moreover we need to be able
to dynamically weight those recognition constraint based on context.

Proposed mechanism to specify weights in range 0-1 in each recognition
step.
UC15:

UC15
(click to enlarge)

EMMA Extension or Richer Semantic Representation

We want to be able to represent the semantic of complex NL. Example
give all those topping except onions, or I want to leave today
afternoon or tomorrow morning to arrive before noon.

In current standards we can represent attr-value pairs and there is
hierarchy in EMMA, but there is no way to specify modifiers,
quantifiers between slots.

Propose to look at existing formalism such as RDFs and maybe extend
EMMA
UC16:

UC16
(click to enlarge)

R&D Agility

As we do research we develop new algorithms that need new information
and we would like to experiment them before standardization process.

We need a reliable mechanism in VXML to carry these information.
Examples: add location information, new DSR signal features.

Today solution is to add value to a parameters, if we had a standard
way to do set vendor specify we will be happy.
UC17:

UC17
(click to enlarge)

Problem Solving

We want to be able to do application that solve complex problems, like
IT problem solving.

The call control logic of such application cannot be today efficently
describe as a state machine therefore available standard (VXML, SCXML)
are insufficient to implement these applications, e.g. probabilistic
rule engine, or task agent system.

I'm not sure it can be efficiently put in standard, but I want to
point out that there is another category of application not handled by
standard.
UC18:

UC18
(click to enlarge)

EMMA Extension: Multi source input and corresponding confidence

In multimodal applications and more advanced applications, input might
come from a variety of simultaneous source such as text, speech, GPS,
world knowledge, user profile, etc, for instance I might say: "I want
to go to Denver" and appl can know from GPS where I am. I'd like for
each concepts or slots to have possibly a multiple set of input
sources with corresponding values and confidence score.
UC19:

UC19
(click to enlarge)
UC19-2
(click to enlarge)

Semantic representation of dialogue state that can include any kinds
of data (e.g. history slot conditions, user models, expectation of
next system actions).

The problem is that current VXML 2.0 specification does not support a
container of dialog state that contain multiple hypotheses of dialogue
state.

Most simple example is:

Sys1: HMYHY

User1: kyoto

ASR: Kyoto (0.7) or Kobe (0.3)
SLU: from and to so 4 dialog states

Multiple dialogue states should be returned/represented in VXML

Sys2: Can generate a prompt ...

Discussion

Combine Use Cases:

Kurt: Propose to combine (7) and (11) called 7+11

** Decision: Merge 7+11

Raj: Combine (10) and (13) and (14)

Simona: Thinks that (14) is a superset of other two

Debbie: Prefer to keep them split

** Decision: Aggregated 10-13-14

======================================

TODO for the next session:
Vote with 10 points

Session9: Prioritize use cases & identify potential new languages and language extensions

Moderator:
Dan
Scribe:
Kaz/Debbie

dan: (shows the following result of the priority/interest vote)

Use case number Tally
10/13/1418
112
512
199
7/118
37
167
126
186
85
95
24
154
43
173
62

dan: natural breaking point should be the top three?
... the result is sorted by the number of tally
... next to do is discussing the three topics

UC10/13/14 (18 points): Dynamic on the fly activation deactivation or combination of any SRGS or SLM grammars

UC10 (click to enlarge)
UC13 (click to enlarge)
UC14 (click to enlarge)

debbie: putting comments from Simona directly onto the editor

paolo: draw a picture about Grammar and SLM

Paolo's picture
(click to enlarge)

question: how to dynamically update SLM?

raj: let's think about simple case with grammar

paolo: we have a set of grammars
... I want to enter deep in the grammars

dan: there is a naming problem
... an example is from city to city
... there is a referencing issue

simona: naming conflict could be solved

raj: it's about categorization

paolo: the point is compiling/optimizing all the related grammars is difficult

dan: weights are available with SRGS, but how to attach rules to weights?

( Dan goes to the flipchart and draw some example rule )

TRIP:= from CITY to CITY

( Debbie adds clarification to the proposal on "mechanism to specify weights in range 0-1..." )

dan: this mechanism is available not only for engine developers/applications author but for users

paolo: even if same weights are given, if different engines, e.g., Loquendo and Nuance, are used, the results are different

debbie: couples of issues mentioned

( Debbie reads the updated notes )

( Debbie adds "nonterminals could be the points of control" under Proposed mechanism..." )

( also adds "or at least enable and disable" to the end of Proposed mechanism to....")

--------------
Debbie's note
--------------
UC10:  mix constrained and unconstrained recognition.

UC13: Enabling/Disabling grammars/reco constraints. should be able to be 
context specific (Activated/deactivated/weighted) Where context, means 
anything (including user profile, history, etc)

UC14. Dynamic on the fly activation deactivation or combination of any SRGS or 

SLM grammars

There are applications that in intelligent conversation combine open-ended and 
restricted language, so we need  a mechanism to specify how to combine any 
recognition constraint. Moreover we need to be able to dynamically weight 
those recognition constraint based on context.

Proposed languages and language extensions:

- Proposed mechanism to specify weights in range 0-1 in each
  recognition step dynamically, or at least enable and disable

  interesting issues with compiled grammars that have been optimized
  need ways to specify weights in the grammar that might change so
  that the compiler knows what might change need ways to tell grammar
  dynamically what the weights are nonterminals could be the points of
  control

- Allow nonterminals in SRGS to refer to SLM's or dictation rules

- Allow SLM to refer to grammar, e.g. class-based SLM's, this requires
  a standard for SLM's

possible extensions: SRGS, and VoiceXML (this might also imply changes
  to protocols like MRCP)

possible new languages: SLM

UC1 (12 points): Applications need to be sensitive to certain (arbitrary, dynamically extracted) features, s.a gender, age, etc.

UC1
(click to enlarge)

paolo: one possibility is extending EMMA to have a place for this info

simona: in a standardized way

dan: we rather make things for the author
... do we need to have a way to ask for specific return rsults?

paolo: do we want to add anything to SRGS/SISR about this?

debbie: maybe VoiceXML?

( Debbie add the following note )

Possible change to SRGS to say "please collect some specific metadata, e.g., gender, age"

dan: to use this info within VoiceXML, we would need to require that the entire EMMA result be available in VoiceXML
... applications' last result is EMMA
... need to take care not to pollute namespaces

--------------
Debbie's note
--------------
UC1:
Applications need to be sensitive to certain (arbitrary, dynamically 
extracted) features, such as gender, age, etc. Example: adjust voice, 
phrasing, etc. based on those features.

Current limitations: current VXML infrastructure only allows
words/interpretations/confidence are standard requirements.  We need a
place to this info somewhere so that it gets transmitted

Possible extensions:

Extend EMMA to have a place to put this information in a standardized
way Do we need to have a way to ask for specific return results?
(VoiceXML) We never have done this before, we just get everything
Possible change to SRGS to say "please collect some specific metadata,
e.g. gender, age" To use this information within VoiceXML we would
need to require that the entire EMMA result be available in VoiceXML

UC5 (12 points): Syntactic Formalism

UC5
(click to enlarge)

debbie: clarification about semantic grammar vs. syntactic grammar?

kurt: not an extension to SRGS but new mechanism for inheritance of category
... part-of-speech in both in lexicon and grammar should be co-related
... here I don't mean any semantics, just thinking about syntactic information

( Debbie adds the following )

Proposal: new language

kurt: I don't see any model which includes feature inheritance capability, etc.

debbie: proposing an XML?

dan: not necessarily

debbie: maybe CSS is an example of non-XML specs?

( extension for SRGS vs. completely new language? )

paolo: reducing searching space has been important for recognition
... I'm a bit concerned

--------------
Debbie's note
--------------
UC5:
Syntactic Formalism. Today an author cannot create a syntactic grammar for 
comprehensive NL because the formalism lacks features inheritance, POS 
terminals, concord, inversion, etc.

Proposal:
new language that can capture syntactic natural language requirements, could 
look at NL literature, anything in XML? 

[ short break ]

Session9: Prioritize use cases & identify potential new languages and language extensions (contd.)

Moderator:
Dan
Scribe:
Debbie

UC 19 (9 points): Semantic representation of dialogue state that can include any kind of data.

UC19
(click to enlarge)


UC19-2
(click to enlarge)

--------------
Debbie's note
--------------
UC19. 

Semantic representation of dialogue state that can include any kinds of data 

(e.g. history slot conditions, user models, expectation of next system
actions).

The problem is that current VXML 2.0 specification does not support a 
container of dialog state that contains multiple hypotheses of dialog state.

Most simple example is:
[[
Sys1: HMIHY

User1: kyoto

ASR: Kyoto (0.7) or Kobe (0.3)
SLU: from and to so 4 dialog states
Sys2: Can generate a prompt ...
]]

Proposal: Multiple dialogue states should be returned/represented in
  VXML.

  How to represent states? EMMA?

New language: an XML data structure for representing one or more
  alternative dialog states, not necessarily linked to any other spec,
  or VoiceXML could ask for what to say or ask for next, and the
  server consults the new data structure.

UC 7/11 (8 points): Combine reco constraints when multiple apps are active simultaneously and transfer focus.

UC19
(click to enlarge)


UC19-2
(click to enlarge)

--------------
Debbie's note
--------------
UC 7/11

UC11:
Combine reco constraints when multiple apps are active simultaneously and 
transfer focus.

UC7: Shared Syntactic Grammars (for simultaneously running applications) 

Proposals:
authoring: 

runtime: we need a method for an arbiter to query an application about
what its capabilities are arbiter also allows user to discover, share,
etc. capabilities of the

applications
aggregate grammars from several applications into a common grammar
ensure that grammars that share the same syntactic formalism can be unified

[End of minutes]


The Call for Participation, the Logistics, the Presentation Guideline and the Agenda are also available on the W3C Web server.


James A. Larson, Deborah Dahl, Daniel C. Burnett and Kazuyuki Ashimura, Workshop Organizing Committee

$Id: minutes.html,v 1.47 2010/06/30 19:24:46 ashimura Exp $