Workshop on W3C's Multimodal Architecture and Interfaces — Minutes

16-17 November, 2007

Keio University Shonan Fujisawa Campus, Fujisawa, Japan



  1. Toshihiko Yamakami (ACCESS)
  2. Deborah Dahl (Conversational Technologies)
  3. Florian Metze (Deutsche Telekom Laboratories)
  4. Ingmar Kliche (T-Systems)
  5. Mitsuru Shioya (IBM)
  6. Seiji Abe (IBM)
  7. Daisuke Tomoda (IBM)
  8. Ali Choumane (INRIA)
  9. Satoru Takagi (KDDI R&D Laboratories)
  10. Eiji Utsunomiya (KDDI R&D Laboratories)
  11. Masahiro Araki (Kyoto Institute of Technology
  12. Sriganesh Madhvanath (Hewlett-Packard Labs India)
  13. Ellis K 'Skip' Cave (Intervoice Inc.)
  14. Masao Goho (Microsoft Windows Division)
  15. V. Raj Tumuluri (Openstream Inc)
  16. Saagar Kaul (Openstream Inc)
  17. Andreas Bovens (Opera Software)
  18. Tsuneo Nitta (Toyohashi University of Technology)
  19. Kunio Ohno (Polytechnic University)
  20. Kristiina Jokinen (University of Tampere)
  21. Kazuyuki Ashimura (W3C)
  22. Karl Dubost (W3C)
  23. Felix Sasaki (W3C)
  24. Doug Schepers (W3C)
  25. Michael Smith (W3C)


Friday, 16 November 2007

Session 1. Introduction

Session 2. Framework for developing Multimodal Interaction Web applications, part I

Session 3. Framework for developing Multimodal Interaction Web applications, part II

Session 4. Multimodal applications on mobile devices

Saturday, 17 November 2007

Session 5. Panel discussion on gesture and handwriting input modality

Session 6. Advanced requirements for the future MMI Architecture

Session 7. Review issues raised during the workshop

Session 8. Conclusions and next steps

Friday, 16 November 2007

Session 1. Introduction

Kazuyuki Ashimura
Felix Sasaki

Welcome to the workshop by Deborah Dahl

Deborah introduces the workshop

Self introduction of participants

Logistics information by Kaz

Host welcome and logistics by Kazuyuki Ashimura

[ Agenda | Logistics ]

Kaz mentions the workshop schedule and logistics.

Introduction to W3C & MMIWG by Kazuyuki Ashimura

[ Slides ]

Kaz presents http://www.w3.org/2007/Talks/1116-w3c-mmi-ka/

W3C's Multimodal Architecture and Interfaces by Deborah Dahl

[ Slides ]

Debbie presents http://www.w3.org/2007/08/mmi-arch/slides/Overview.pdf

fsasaki: combining various modalities, you could create a timeline of various time offsets, but I didn't see these in the list of events
... so I'm wondering if that's embedded in EMMA, or... ?

Debbie: such information might be available in EMMA. Is also an idea to have that information within the interaction mananger
... but we have that not yet

Doug: have you thought about a distributed DOM?

Raj: architecture supports that kind of information in principle

Debbie: but we do not support directly a distributed DOM

Doug: who will implement EMMA or this architecture? Browsers, special clients?

Raj: depends on the place of the modality components

debbie: some components combine input and output. There is no constraint that a component is only input or output

araki-san: There is the fusion component in the current version of the architectture?

araki-san: If a component is nested, fusion gets difficult

debbie: insides of a nested component are outside of the MMI architecture

Ingmar: there is not necessarily a user interface
... components could be a kind of abstract

InkML: Digital Ink specification at W3C by Kazuyuki Ashimura

[ Slides ]

Kaz presents http://www.w3.org/2007/Talks/1116-ink-ka/

Ali: in some applications you will need more information than the points, e.g. also time information

SriG: time is one possible channel

doug: what is the use of the tilt of a pen?

kaz: if we use a kind of brush, the tilt is very important

Shioya-san: can InkML help retrieving how a person feels, the person handwriting

SriG: InkML is independent of such applications
... emotion or form filling can be done with a basic annotation element in InkML
... or you can create a different langauge and embedded InkML in that langauge

Ingmar: there is an incubator at W3C working on emotions, you might look into that

Shioya-san: it's a complicated topic

Debbie: InkML is like a WAV file for audio. It is low-level representation
... e.g. hand-writing recognition is a different level than InkML

Session 2. Framework for developing Multimodal Interaction Web applications, part I

Ingmar Kliche
Mike Smith

"SCXML, Multimodal Dialogue Systems and MMI Architecture", Graham Wilcock & Kristiina Jokinen

[ Paper / Slides by Jokinen / Slides by Wilcock ]

[Kristiina Jokinen presenting]


[video demo of speech-recognition in noisy environment]

[Kristina walking us through slides]

MUMS - MUltiModal navigation System

Input Fusion (T. Hurtig)

Interact system / Jaspis (Tururen et al.)

agent-based system
... agent selection based on particular heuristics

Adaptive Agent Selection

"estimate evaluator" selects best available agent

Conclusions [slide]

includes "Tactile systems seem to benefit from speech recognition as a value-added feature"

Raj: can you elaborate on details of users?

KJ: Interesting thing: Expectations biased user evaluation of the system ...
... if they though they were talking to a speech system that only that this pen ...
... their reaction was bad ...
... but if they were working with a pen with speech as an added feature ...
... their reactions were more positive ...
... but in both cases it was exactly the same system ...

Raj: for TTS over network, [latency is very annoying to users]

KJ: yes, that is a problem with mobile applications [of this technology]
... one other thing we found is that when we divided users into age groups ...
... interesting to note that young, under age 20 users
... and age 55+ users ...
... the young users really had positive expections ...
... but the middle-aged users had low expectations ...

MSh: But you also considered skill level of users?

KJ: Younger users are more prone to different types of gadgets ...
... so they [seem to have] already collected experience
... need to consider adaptation of the system -- interaction strategies -- when the users [have trouble with the system]

Ali: About gesture interpretation: What is the technique to select objects [on screens that are small and crowded with such objects]

KJ: If you have some good suggestions I would be happy to get them :)

[Ali provides some examples]

KJ: Have to take into account that the user can actually be wrong

Ali: I think this is actually a problem with all such systems

kaz: and we can talk about that kind of issues related to disambiguation using multimodalities in tomorrow morning at Session 6 :)

Ingmar: Is your system actually event-based?

KJ: yes, it it basically event-based and asynchronous

Ingmar: Looks like you have split up [the managers into your architecture] into smaller parts that communicate with each other [across the managers]

KJ: there are abstract devices that these managers define
... each manager checks if the shared information storage is in one of the states [that one of its agents expects]

Ingmar: any declarative languages for this system?

KJ: Java based...

Ingmar: How much of those components are application-specific?
... have you looked to see if the workflow through the system can be generalized?

KJ: though some of library components could be shared, most parts depend on each application

Ingmar: Is there a reason why you are not using EMMA?

KJ: when the first version of the system was built, we did not know about EMMA yet :)

Ingmar: If you could look into that and provide information back to the group, would be much appreciated

Multimodal Framework Proposal, Skip Cave

[ Paper / Slides ]

slide: Paradigm-breaking examples

Raj: The components per-se do not have to be executing scripts ...
... so, there is already asynchronicity present

slide: Questions: how does an Interaction Manager handle modifications to an ongoing process, parallel processes?
... Possible New Lifecycle Interaction Modes
... [discussion of possible solution to deal with problems of modify events and parallel events]

<kaz> so "modify" means some kind of "modifier" like [SHIFT] or [Ctrl]

<kaz> and "parallel" is kind of superimpose

slide: a benefit of this proposal is that it does not require developers to write asynchronous event handlers on modality components

Raj: You can use a privilege-based system to control access to, e.g., speakers

Skip: You don't want to always add the kind of overhead, especially for simple systems.

DD: The modify proposal reminds me of DOM modification
... wondering if that might be too fine-grained
... might be able to get away with something a bit more abstract
... so that interaction manager doesn't have to know how to handle DOM modification
... If we wanted to implement that modify proposal, we would need to think about what the API would look like

Florian: If the first "done" comes back too quickly, then you have a problem ...

Ingmar: maybe it makes sense to place limits on how multiple starts [and such] are handled
... [discusses other specific restrictions that would make this more implementable]

<kaz> ... whether allow parallel/simultaneous/multiple invoking or not

<kaz> ... sharing/mixing media control

Skip: mixing and replication are relatively simple
... keep in mind the media component really has only one input stream and one output stream

<kaz> maybe Skip is suggesting sub Interaction Manager named "media stream controller"

Focused discussion: What kind of features should a Mulimodal framework have?

DD: I note multiple mentions of fusion ... something we have not discussed in the group for a while [but maybe should]
... I also heard some interest in timing, at the level of the Interaction Manager

Ingmar: Do you have a specific use-case for this that illustrates what you actually mean by "timing"?

[we note that Felix brought up this issue this morning]

Ingmar: This is covered in EMMA, actually

Felix: Does EMMA define identity of two sequences or other temporal relations, etc.?

DD: Basics are in EMMA, but it's not really defined.

shepazu: but it probably should not really be defined in EMMA

DD: kaz pointed out that EMMA has concept of relative timestamps

kaz: lattice functionality might be available for that purpose

DD: We have requirements from accessibility people related to timeouts

Raj: [to Felix] the information you would need is available from EMMA

Raj: I am looking for relations like these: http://en.wikipedia.org/wiki/Allen's_Interval_Algebra are they all available in EMMA?

shepazu: mentions case related to SMIL and slow-down

Skip: VoiceXML has features related to speed-up and slow-down

shepazu: the rate at which most real-world users of screen readers have their readers set is so fast as to be incomprehensible [to a mere mortal such as I]

DD: another use case for parallel audio might be simultaneous translation
... and may have, e.g., 4, 5 different users listening to different translations

KJ: about timing, I was wondering about error management, and about multiple users, wondering about collaborative situations where users can affect what other users receive

Ingmar: Looking at the current spec, it is handling those cases.
... About the Fusion module, is this something that should be handled inherently in the architecture?
... as a generic [part] of the framework

Raj: error handling is a big problem from the authoring perspective

Ingmar: Isn't this the point of common-sense guidelines, and don't we have a doc for that already?

DD: Do we expect the Interaction Manager to do too much?

KJ: Thank you for saying that.

Raj: <sync> tag, for example, is very useful

<kaz> topics: fusion of stream media, timing control, error handling, multiple users, accessibility, IM too much?, higher level language than SCXML?


Session 3. Framework for developing Multimodal Interaction Web applications, part II

Raj Tumuruli
Kazuyuki Ashimura

Proposal of a Hierarchical Architecture for Multimodal Interactive Systems, Masahiro Araki

[ Paper / Slides ]

masa: joint work with Prof. Nitta
... using Multimodal Toolkit Galatea
... will mention some issues on W3C's MMI Architecture
... tackling another MMI Architecture for Japanese based standardization
... viewpoint is rather research based while W3C's one is implement based
... Galatea Toolkit includes: ASR/TTS/Face generation
... (note TTS and Face geration are managed by sub manager)
... comparison to W3C's MMI Architecture
... Modality Component is too big for life-like agent...

masa: consider several examples:
... e.g. 1 lip sync functionality using W3C's MMI Architecture
... e.g. 2 back channeling
... in both case too many connection between MC and IM would be required
... another question is: fragile modality fusion and fission
... how to combine inputs???
... e.g. speech and tactile info
... e.g. 2 speech and SVG
... contents planning is needed
... third question is: how to deal with user model?
... changing dialog strategy based on user experience
... history of user experience should be stored
... but where to store???
... how about going back to previous "multimodal framework"?
... which has smaller MC
... in addition, should have separate transition: task, interaction, fusion/fission
... started with analysis of use cases
... then clarified some requirements

masa: and now would like to propose 6 layered construction
... now planning to release: trial standard and reference implementation
... explains event/data sequece of the 6-layered model
... L1. I/O module
... L2. Modality component: I/F data is EMMA
... L3. Modality fusion: output is EMMA/Modality fission
... L4. Inner task control: error handling, default (simple) subdialogue, FIA, updating slot
... L5. Task control: overall control
... L6. Application

masa: but do we need some specific language for each layer?
... don't think so
... summary: propose this "Convention over configuration" approach

kris: what is the difference between L4 and L5?

masa: client side vs. server side

raj: there is focused discussion session and which can talk about the details there

ingmar: what if you have small and many MCs?

masa: actually this architecture doesn't care the uni/multi modalities

skip: what do you mean by "W3C's MC is too large"?

masa: please see this lip synch app
... there is no way for each MC directly interact
... but need to connect the IM

skip: so you mean TTS should be able to directly talk with other modality?

masa: exactly

skip: same issue I poiinted out :)
... so you suggest lower level information connection?

masa: lower connection implemented using layer 2
... can be handled by lower leverl e.g. devices

kaz: where should device specific information be hanled? layer 1?

masa: could be handled by upper layer? (sorry missed...)

ingmar: looking at the 6-layer picture
... L4-L6 corresponds RF?
... why not L3-L6?

raj: thinks your model is better, e.g. it mentions user model/device model

masa: user model is actually vertical layer, though

ingmar: what is the relationship between layers and languages?

raj: not necessarilly explicit languages

debbie: any other questions?

srig: very useful for research :)

W3C Technologies as Consumers of Multimodal Interfaces, Doug Schepers

[ Paper / Slides ]

doug: SVG, WebAPI, CDF contact
... SVG: not only desktop but also mobile
... Web API: DOM, script and dynamic content, auxiliary tech e.g. Ajax, Selectors
... CDF: mixture of (X)HTML, SVG, MathML, etc.
... and WICD
... who are the consumers of MMI???
... 1. implementors
... 2. integration
... 3. client-side functionality
... some questions
... 1. is Ink API or ML?
... set of channels expressed in pointer event?
... 2. SVG as drawing app?
... pressure info as well
... 3. VoiceXML handles DOM?
... integration of Voice and SVG?
... 4. DCCI: geolocation, battery life, etc.
... 5. Multiple touch: specifying line using 2 simultaneous touches etc.
... and multiple users
... challenging from DOM persective

<karl> http://www.youtube.com/watch?v=0awjPUkBXOU -> hacker using the Wii in a funny way.

doug: how to liaise?
... direct liaison or inspiration/use cases evaluation?
... SVG as data/event source

<karl> http://developer.apple.com/documentation/AppleApplications/Reference/SafariWebContent/HandlingEvents/chapter_8_section_2.html -> one finger event on the iphone

doug: users specification like drawing a circle could be stored as an event
... distributed DOM
... intersection???
... what's the common infrastructure?
... DOM integration
... implementor focus
... how to encourage users/implementors?

karl: what doug talked is very important.
... We have to capitalize on the hacking trends, on the Web community developing already cool stuff and extend if necessary.
... It will be a lot easier to work with than on a separate thread.

doug: it's important to understand what is happening within W3C
... e.g. InkML
... compatibility of specifications etc.
... liaison is important
... for implementors

debbie: on one hand we have PC world, and on the other hand we have Voice world
... e.g. VoiceXML 2.X doesn't handle events
... the interection between these different worlds is important/interesting/problematic
... from Voice viewpoint VoiceXML is nice guy for the candidate modality
... while GUI modality is a bit difficult to handle

Raj: W3C originally considered desktop modality...
... we have not been considering multiple devices/modalities
... we had to handle some of the specs as a "black box"...
... e.g. DOM

doug: CDF has two approaches: reference and embeded
... in case of reference, "blackbox" approach should work
... security is another difficult topic

skip: Voice Browser WG is considering event handling
... e.g. record, play, recognize

doug: DOM3 events have several essential commands

mike: we should remember the importance of involving various web browser vendors.
... maybe starting with plug-in might be a good idea


Session 4. Multimodal applications on mobile devices

Sriganesh Madhvanath
Doug Schepers

Multimodality in Mobile Force Applications, Raj Tumuruli

[ Paper / Slides (TBD) ]

raj: we've been waiting for a state structure... SCXML looks useful
... we want to standardize these things so they work in the real world
... we need to identify the challenges we face

raj: devices are getting more advanced, but we need browser functionality to catch up
... will make slides available later
... our presentation includes a field study done by our engineers
... regarding the functionality of the UIs
... major challenge is disparate functionality between devices
... need embedded TTS and ASR client on device
... devices and applications must be generalized to different tasks, challenge in aggregating functionality
... having functionality about device state (connectivity, battery life) would be valuable for back end processing
... we have unsophisticated users ("blue collars") who need to be able to communicate naturally ("I want that")
... our study really captures how necessary this is
... mixed mode is necessary in field (even if it's not accessible)
... mixed-initiative input method lets you fill out multiple fields at once (like an address, all in one breath)
... also combining voice and pointer events in a single interaction
... this saves a lot of time... 8% efficiency increase
... we're XML-centric
... mobile devices don't have XML events and other key technologies
... can't send events, so need other ways to solve the problems
... Access and Opera don't let me pass events to local TTS/ASR engines...
... one of our guiding principles is the ability to nest components in a recursive way

[references interaction manager as example of component aggregation]

raj: one of the biggest challenges is assuming that all the components boot up in a consistent order, with no drop-outs
... this is important, for example, for modules for different languages (English and Spanish)... so we have to query what the current device capabilities are
... we use a kind of registry to show what components are available, and we pass along our capabilities to the application
... for example, if the user has pressed the mute button, we don't want to send the sound down the line
... the problem with the encapsulation model is that it's mediated through an Interaction Manager, and we need to have 2 modes (for offline vs. online behavior)
... at short notice
... while maintaining compliance to MMI architecture

[shows demos]

scribe: need simple-to-use interfaces
... might need to send down specific file formats depending on the device capabilities...
... different behaviors based on battery levels and roaming status
... we want to benefit from the collective wisdom to come up with the next generation of the spec

KJ: how important do you see the field study for this work?

raj: absolutely essential to have an iterative approach
... lots of time is wasted training new users
... so as you're filling out the fields, it tells you how to fill it out

Challenges in Mobile Multimodal Application Architecture, Toshihiko Yamakami

[ Paper / Slides ]

TY: Access is a small browser company based in Japan
... also do VoiceML browser

[talks about history of Mobile MM]

TY: OMA will terminate work item without Tech specs... only architectural work done

[chart: top mobile subscribers (China, US, Russia, India, Brazil, Japan), and PC vs. Mobile users in Japan (tight race)]

TY: Landscape: bandwidth, CPU, memory all increased, mobile Linux emerged
... improving every day
... needs "spiral evolution" among services, content, and end-users
... need more capabilities and ease of authoring
... obstacles in mobile use:
... encapsulation, multimodal contexts, and content authoring
... generalization and superficial design cause problems, environment changes faster than standards can catch up
... need to customize to user and context of the use case
... need to fill gap between general framework and real-world uses
... need to solve authoring problems... intutive interfaces, better testing, solve problem async processing between multiple entities
... no conclusion, no silver bullet... industry will like to see mobile multimodal apps

Raj: customers want more, but industry not delivering...

TY: yes, but it's also a matter of education... and mobiles are getting more powerful

raj: how is content authoring relevant to MMI... content-authoring vs. application-authoring

TY: same thing to me

Kaz: content as service

TY: browser is enabler

raj: are you still working with IBM with VoiceML?

TY: well, the product is here

raj: browser vendors are not forthcoming about future capabilities

TY: we focus on narrow markets

Felix: you say encapsulation is a hindrance... does the audience think encapsulation is vital to architecture?

Skip: when you have a distributed environment, it's necessary... not so if it's all on same device

Raj: and encapsulation is also important for distributed authoring

<kaz> topics: encapsulation, authoring

Debbie: may be authored by different people as well as at different times... VoiceML authors are specialized

TY: multimodal is close to user interactions
... dependant on interaction context and user

<kaz> topics: user/app cotext

KDDI's position - Workshop on W3C's Multimodal Architecture and Interefaces, Satoru Takagi

[ Slides ]

ST: I developed Web Map... distributed platform for maps using SVG
... SVG Map consortium in Japan
... maps are one of KDDI's main businesses
... mobile devices are more powerful, so we are converging to One Web
... for both desktops and PCs
... Web Map is killer Web app

<kaz> topics: one web=web on mobile

ST: but functionality is mostly proprietary services

<kaz> topics: static server + light viewer

ST: SVG allows for direct use rather than larger infrastructure
... allows for offline and online use... scalable

<kaz> topics: mobile as a robust infrastructure

ST: requirements for Web are lightweight US with rich functionality, simple static server
... if MMI shoud be basic part of the WWW, it should also satisfy these requirements
... needs to work on limited devices, standalone
... RDF integration can work on tiny computers if done well

DS: how are you using RDF on mobiles, which is verbose?

ST: microformats

DS: how are you using RDF?

ST: for geospacial information
... developed in W3C

Debbie: are you using RDF instead of SCXML for Interaction Managment, is that right?

ST: it would be integrated with SCXML as the data model
... expressed in query



Saturday, 17 November 2007

Session 5. Panel discussion on gesture and handwriting input modality

Kazuyuki Ashimura
Doug Schepers

Use cases of gesture and handwriting recognition, Sriganesh Madhvanath

[ Slides ]

SM: [describes pen input]

SM: different modalities (pen, voice) are good for different purposes/people
... light pen predates mouse by 5 years
... more interest recently because of tablet PCs
... touch is also exciting lately
... lots of variety in pen devices
... passive and active styluses (stylii?)
... lots of variation on writing surfaces (size, orientation, etc.)

[slide of various pen/writing devices]

SM: touch has many commonalities with pen input
... different different operations and capabilities and modes
... all indicated as different "channels"
... different UIs... pen-only, supplemented by pen, pen as mouse
... pro: fine motor control
... con: limited by hand movements sequentially

<kaz> Pen functions: point, select, write, draw, gesture, sign, text, figure, add info to picture

SM: common functions: pointer, selector, writing, drawing, gesture, sign...
... loosely broken down into modalities like tap, write, gesture, draw
... including handwriting recognition
... can have buttons on barrel
... pen input as data... writing or drawing... uninterpreted in raw form, can be interpreted if needed
... use cases: white boards, signatures
... pen input as annotation
... almost anything visual can be annotated
... "inline" overlaid, or "attached" as reference
... difficulty in precisely associating with source content
... pen input as gesture
... can be use in combination with speech
... very popular form of input
... gestures can launch applications or trigger functionality
... gestures are lightweight form of personalization (compared to passwords, etc.)

<kaz> topics: gestures as commands for browsers

SM: gesturing can be context-dependant or generalized... might use gesture over shape to annotated or do other things to it
... pen input for text recognition is not the most popular... pen input for IMEs is more popular
... different interfaces can be tapping or pattern-based... can learn via muscle memory
... can use autocomplete for partial character input
... "combinational gesture and selection"

<kaz> topics: error correction (=disambiguation) using GUI in addition to Ink

SM: sketch-based interfaces...
... draw rought square and turns into perfect square
... flow charting
... searchin image repositories

[demos "Fly Pen"]

<kaz> topics: Pen devices and InkML

SM: Ink + Other Modalities
... ink as note taking is old idea

<kaz> topics: multimodal applications and W3C's MMI Architecture (including InkML)

SM: for lectures, whiteboard, brainstorming, photo sharing
... writing while speaking
... drawing + speech
... speech changes mode
... Gesture + speech (interpreted)
... maps: "put that there"
... Integration: tight coupling

<kaz> ink inofomation interpreted by application

SM: challenging to determine mode automatically... circle could be mouse movement, selection gesture, O or 0... leave as ink?
... contextural inferences (text field is data, button is GUI)
... loose coupling
... app doesn't interpret pen input...
... abstractions into mouse movement
... scalable, but no access to rich ink data
... what is a pen? finger? wii (3d)? keitai with accellerometers?


Kinesthetic input modalities for the W3C Multimodal Architecture, Florian Metze & Ingmar Kliche

[ Paper / Slides ]

JK: gives intro... Wii, twiddler

scribe: accelerometers (spatial translation), optical input (cameras...rotation and translation)
... single-handed operation
... at sensor level, 2 approaches

JK: uses: gaming, list navigation, gestures, context recognition

scribe: degree of tilting can give natural gestural input

IK summarizes chart regarding use cases

scribe: continuous vs. discrete, relative vs. absolute

JK: Deutsche Telecom uses in "MediaScout" to make suggestions for video/music content, allows you to rate it

scribe: voice and haptic feedback
... uses camera to interpret movement, no additional hardware needed

[video demo]

scribe: "device nodding" as gesture
... for navigating and selecting in GUI

JK: implemented in Windows Mobile

scribe: GUI is HTML+Javascript

<karl> it reminds me of http://www.youtube.com/watch?v=A3MfQIswl3k

<karl> iBeer

<karl> ;)

JK: ActiveX detects device capabilities (inertial sensors vs. camera)

scribe: abstracts input out into event generation for predefined motion patterns

<karl> other interesting things with kinetics. Games - http://www.youtube.com/watch?v=Wtcys_XFnRA

JK: difficult to add new gestures

scribe: might be better to allow gesture libraries or gesture markup

JK: client-side interface now, but working on server-side option

scribe: using serverside Interaction Manager using w3c MM Framework
... all browser-based
... calculation of system response important
... introduces latency

JK: client-side vs. server-side

scribe: client is faster, but simpler, depends on device capabilties
... server more powerful, but introduces latency and needs connection
... combination also possible

Panel discussion: what kind of data/events/APIs should be used for gesture and handwriting input?

[discusses architecture]

JK: uses MMI-Lifecycle via EMMA structure... made small extensions

JK: not only kinesthetic, but haptic responses (vibration)

scribe: proprietary extension
... provides feedback for gesture recognition

JK: open issues

scribe: latency
... pattern-matching rules

JK: conclusion

scribe: did study, users dig it
... how to bring this functionality into today's browsers
... some trust issues
... how to bring into Web applications?

raj: how do you process multiple gestures in event sequence, with regards to multiple starts before an end
... especially regarding combinations with voice input?

JK: right now, only the abstracted event (up, down, left, right) is sent up to server for processing... based on pattern recognition

raj: you might introduce disparity between gesture events and synchronized voice input

JK: to some degree, can be done in parallel, where end point is when the user stops moving

KJ: how can you make distinction between gestures and noisy movement?

JK: yeah, that's a problem

KJ: might have a threshold for movement

DS: is there a mode selector

JK: yes, a "push-to-move" button

Kaz: also can be done with speech

DS: does the camera use a lot of battery power? can you use the camera at a lower-level software than a move?

FM: not at this time, but we're looking at it

SM: why cameras vs. inertial? and what's haptic vs kinesthetic?

JK: camera is already in phone

KM: haptic is touch (feedback), kinesthetic is movement (interface and input)
... are people interested in this problem? games and such use proprietary means to do this

DS: yes, but we need to make these events available in specs that desktop browsers are implementing, with limited scope

Karl: let's reuse existing components

JK: we had to do this in proprietary way... no standard

<karl> http://lifehacker.com/software/gaming/gaming-with-the-macbooks-accelerometer-181203.php

MG: now we expose hardware interfaces through ActiveX control

scribe: if you don't want to use activeX, OS vendor has to expose these some other way...
... do you plan to do your interface in Symbian, for example?

JK: not at this time

FM: Android OS is supposed to expose this as well

MG: I'll give feedback from this workshop to our PM in China

raj: any toolkits you know for QBI (querying by image)?

SM: the most advanced ones are not open... still being explored

IK: are there any of these interfaces using inkML?

SM: most matching is done by nearest-neighbor algorithm... may need to have mediating format for InkML

KJ: fascinating, but from a user point of view, how can each of these modalities be optimized per task?
... would be nice to have a study

SM: there's been some research into speech+gesture, with some categorization of people who use each or both... depends on cognitive mode
... people use modality they are most comfortable with
... saw a paper on wheelchair-bound pen-control of a smart house, but most users used modes they were more familiar with, like voice
... task complexity has a lot to do with it... switching on a light is easy with a pen, but using a map to query for direction is a more complex task
... it would be best to create models per user
... one objective of this group is to integrate different modalities into standard interfaces

raj: the MMI architecture is not just for speech or pen or other specific modalities, but is open to different modalities, based on lifecycle event

Debbie: we need to be higher level, operating on notification basis
... we don't want to define universal APIs

Skip: if we can't do this standardizing, then someone else will
... it's useless unless devices standardize

Debbie: exisiting speech APIs are diverse
... hard to standardize

<kaz> MRCP is a protocal for speech interface

Skip: we have to work at a device-independent level

Debbie: might be outside the scope of MMI

Skip: I think someone needs to do it, not sure who


Session 6. Advanced requirements for the future MMI Architecture

Tsuneo Nitta
Kazuyuki Ashimura
Relevant topics:
  • Device capability
  • Reuse of existing mechanism
  • How to disambiguate possible meanings using multiple modalities
  • Language specific difficulty
  • User identification

IBM's position - Workshop on W3C's Multimodal Architecture and Interefaces, Mitsuru Shioya, Seiji Abe & Daisuke Tomoda

[ Slides ]

dai: IBM is interested in MMI

dai: I/O device capability
... mobile/desctop/phone
... application use case
... navigation on mobile phone
... framework: client/IM/voice server/web server
... requirements: how to describe device capability
... how to transfer that information to apps
... proposal: device capability definition
... some XML specification
... proposal2: dynamic modality change
... Re-initialization of device
... user setting/preference
... modality change trigger
... interface between device and MC
... proposal3: lifecycle event: Data event should contain device state change
... proposal4: device dependent state control
... summary:
... dynamic switching of state control
... esp. timing of behavior change
... session control

raj: not sure how this proposal is related to DCCI

ingmar: client driven change of modality etc.

raj: device status need to be watched then
... asynchronously
... in addition, need to have device context/capability though IM

ingmar: device capability ontology activity by UWA WG might be useful

raj: for device controller

debbie: how to make IM responsible for device management?
... all kind of transition/information on device capability need to transfer to IM?

raj: you don't need to communicate all the details to IM (to the presenter)

ingmar: the point is how to handle device capability with MMI Architecture and its event

debbie: please revisit the figure mentioning the events

(presenter moves to the page)

kaz: what is your concrete proposal/idea on device capability management?
... who do you think should manage that information
... and how?

raj: current DCCI has no support on how to transfer device information to IM etc.

debbie: this picture seems nice use of events for device infomation

doug: related to DOM

raj: that's more implementation question

<ddahl> actually, i think the StatusRequest event might be a good way to query devices about their capabilities

doug: curious how browser vendors are interested in this

topics: how to integreate device capability handling to web browsers

Polytechnic University - Workshop on W3C's Multimodal Architecture and Interefaces, Kunio Ohno

[ Slides (TBD) ]

kunio: various modalities
... home network
... existing protocol: DLNA, HDMI
... 10 foot UI
... something better than text input needed: voice, ink?
... digital TV
... evaluation: key stroke level model, speech act theory
... digital TV menu
... summary:
... mobile phone novel: Japanese layout?
... digital TV interfaces

tsuneo: digital TV is important use case

kris: interested in speech act theory

kunio: it's refered for Web based manual for that purpose

srig: (scribe missed srig's comment)

<ddahl> i think that's the sense he's using

kaz: interested in evaluation

kunio: basically keystroke based evaluation
... maybe need 2-3 years, though

kaz: do you think there is any Japanese specific specification/mechanism for MMI?

kunio: input JP character (Kanji etc.) is difficult
... Ruby or vertical writing might be also difficult

felix: scinario for one hand control might be useful considering we need to hold the device with one hand and can use only another hand for input


Session 7. Review issues raised during the workshop

Kazuyuki Ashimura
Karl Dubost

Review discussion list

kaz: I will go through the issues raised during the two days discussions.
... do you want to add things to this list?

debbie: do you think that dcci issues are part of it?

kaz: good point.

debbie: we should do metacategories for this list.
... and we also talked about APIs like modalities APIs
... something like life cycle events, EMMA.

raj: we want to add also two things
... 1. Interaction Manager
... 2. somethingA

debbie: can you add Encapsulation and tight
... /loose coupling

kaz: We can categorize the topics and discuss some of them.
... We could create an authoring category

kaz lists categories on the whiteboard given by participatns

* Authoring

* Integration and APIs

* Distribution and Architecture

* User experience and accessibility

ingmar: We wanted for example know how to describe motion components, which create one modality by itself.

debbie: Speech grammar is out of scope

kaz: then maybe it should be modality component specific grammar.
... where should we put it? User experience?

debbie: isn't it authoring?

[kaz continues to go through the list of issues]


skip: Interaction manager is a messaging infrastructure
... How do you discover particular components?
... How a component is being self-advertised?
... part of the issue is a publishing/advertisement messaging infrastructure framework.
... GMS has one for example
... and we do not have to define it at all because it is defined in other standards.


debbie: are there the kind of things that device description already gives?

raj: hmm not only

skip: It is not only about technical things like screen, but also audio, smell, haptic, etc.
... some modalities components have different samplings
... and depending on the devices you want to change the sampling modalities

debbie: I'm not sure if it's a good match
... but the ubiquitous group has defined ontologies of devices
... we should really look at the work of this group

skip: There is a lot of diversity in the description of components and modalities

kaz: Do you think that device capabilities management should be part of the Architecture?

skip: for example, an audio device can record a sound file. And that's just it

ingmar: I would put it in the architecture domain

kaz: topic on synchronization


An Introduction to Device Independence

debbie: I thought i heard something persistence.

raj: I need to know the preferences of each user
... the interaction manager requests how the data should be presented.

sriganesh: if you want to change the full user experience, it has been to deal at the interaction manager level.

raj: user preferences could affect more than one devices
... sequences are business logic, it is not at application level

skip: The interaction manager has to know the user preferences anyway

raj: I have a different idea of an interaction manager
... for me it is only a message broker between different modalities components.

skip: Then we need an application tower in the system

debbie: We have everything intertwingled now, we want to find the different location of parties. The information from applications has to be some places.

ignmar: what is the interaction manager?

raj: we have to define what you mean by interaction manager

speakerA is explaining what he thinks about interaction management

debbie: This discussions shows that we have to pinn down what interaction manager means

[participants explaining what are their different ideas of what is an interaction manager]

shepazu: could you define what is life cylce events

debbie: stop, pause, forward, etc.
... we busy working on how the modalities components work?

shepazu: a tutorial should help to understand the technology

kaz: what do you mean by "tutorial"?
... primer kind of document for implementors/authors?

shepazu: something which explains the vocabulary.

debbie: did you read the spec?

shepazu: I skimmed through

debbie: Ingmar is working on something.

karl: a 101 document would be very good.

debbie: yes we had a few times the request and we might be not understood by other people

raj: it happened that didn't understand what we said.
... ambiguities for example.

Session 8. Conclusions and next steps

Deborah Dahl
Felix Sasaki

Finish writing issue lists

[ Issue list gathered through this workshop ]

kaz goes through the list

Debbie: need to discuss: what is in the interaction manager?
... Kristina has pulled out some functions of the IM
... Prof. Nitta had some functions too.

Florian: Depending on your approach (e..g hierarchical, agent based)
... you will have different approaches. We might not find consensus here

Debbie: just want to list functions, not evaluating them

Kristiina: we know from current appls that there is a need for discussion of IM
... e.g. application specific IMs
... application and interaction data is closely connected
... however they need to be separate components / modules

Debbie: message routing also an issue

Ellis: for me there is message router, sensor, thing which generates messages at application level

Kristiina: need space in architecture and implementations, since each application needs some specifics
... we should not decide things too closely

Debbie: "modality components communicating directly" issue?

Ellis: depends on (physical) distance between components whether they can (should) communicate directly

Raj: application author decides this

Ellis: MM component could subscribe to a specific event of a different component
... depends on the intelligence of the component(s)

Debbie: accessibility topic?

Raj: it's an authoring issue
... may not always conform to accessibility guidelines

Ellis: let the user pick his modality

Doug: e.g in a noisy environment
... the more modalities identical, the easier they are to author

Ellis: developer specifies high-level requirements, e.g. "give account balance", and modality components implement that on lower lever

Debbie: we talked about time outs and accessibility, don't remember other aspects of accessibility

Karl: haptic and kinetics accessibility issues?

Debbie: yes, like mouse click problems

Karl: or voice problems

Ellis: modality component should deal with that

srig: standard names for input and output modalities

doug: don't know how to integrate the work of MMI with my world (e.g. DOM, SVG)
... e.g. events in DOM

Ellis: voice xml could generate necessary events

Debbie: voice xml 2.1 browser can be controlled by CCXML

Doug: how will a script author approach that?
... person talked into a phone, I will print it on the screen
... we have all pieces, need to put them together

Ellis: present the scenario to voice xml WG
... they have all pieces to put this together

Debbie: voice browser have to respect inline markup
... currently all needs to be in an voice xml form

Doug: how to integrate InkML? How to integrate voiceML? How to describe device capabilities?
... these questions are important for web application authors (our main market)
... serialization of InkML in SVG might be interesting

Ellis: vxml needs granularity, but would be good too to send SVG to a recognizer and get the text

Doug: having an abstract layer would be good for the application author

Ellis: yes, he could enable all modality components
... spoke, pen etc. the words "get my account balance"
... user preferences show how the user wants the information back
... that's the ideal architecture
... components create a semantic event up, business logic processing happens, and the result is sent back

Kristiina: you are talking a lot about text, but we are handling with a lot of non-textual information
... people may be pictures and send out photographs
... e.g. museum application

Ellis: picture of three people, saying "who is who?". I send it to my appl. and get the information back
... for me that's the same way of processing

Debbie: EMMA provides the semantics

Doug: would like to have REX 2 to send events

Debbie: not sure about the status of REX
... original REX has some problems

Ellis: two different entities:
... modality component takes handwriting and voice
... but recognition algorithms could be independent of these components

Debbie: looks like a BP for specific systems
... but not s.t. general for MMI architecture

Ellis: Doug's use case is to just to send information back and forth without thinking of details
... at the business logic level

srig: application manager can send already scripts to modality components , so we provide this functionality already, no?

Debbie: MMI architecture allows you to have more fine grained voice components, it's just not part of voice XML

Ellis: would like to be independent of specific modality components

Debbie: so a light-weight API

Ellis: yes

Doug: not necessary to send events to DOM, but other things are possible too

Ellis: I like idea of light-weight interface
... e.g. one simple process of recognition, conversion, TTS etc

Doug: with such an API you should be able to say "I will give you this, I will get back that"

Felix: question on time line

looking at http://www.w3.org/2006/12/mmi-charter.html

Karl: did you look at what is existing in the industry?

Debbie: looked at galaxy architecture, developed in 90's (DARPA)
... not many people publish commercial MM architectures
... there was SALT and something else at the beginning of MMI, but that did not fit
... because of need to work distributed and have extensibility for new modalities

Raj: browser-centered view of the universe is not flying for us

Kristiina: agree with last point.
... e..g many universities have built their own systems, it is interesting to compare these

W3C participation recap and the next step

Kaz: concludes the workshop with explaining the possible next step as follows

[Wrokshop ends]

The Call for Participation, the Logistics, the Presentation Guideline and the Agenda are also available on the W3C Web server.

Deborah Dahl and Kazuyuki Ashimura, Workshop Co-chairs

$Id: minutes.html,v 1.8 2007/12/10 12:58:56 ashimura Exp $