Deborah introduces the workshop
Self introduction of participants
Logistics information by Kaz
Kaz mentions the workshop schedule and logistics.
[ Slides ]
Kaz presents http://www.w3.org/2007/Talks/1116-w3c-mmi-ka/
[ Slides ]
Debbie presents http://www.w3.org/2007/08/mmi-arch/slides/Overview.pdf
fsasaki:
combining various modalities, you could create a timeline of
various time offsets, but I didn't see these in the list of
events
... so I'm wondering if that's embedded in EMMA, or... ?
Debbie: such information might be
available in EMMA. Is also an idea to have that information
within the interaction mananger
... but we have that not yet
Doug: have you thought about a distributed DOM?
Raj: architecture supports that kind of information in principle
Debbie: but we do not support directly a distributed DOM
Doug: who will implement EMMA or this architecture? Browsers, special clients?
Raj: depends on the place of the modality components
debbie: some components combine input and output. There is no constraint that a component is only input or output
araki-san: There is the fusion component in the current version of the architectture?
araki-san: If a component is nested, fusion gets difficult
debbie: insides of a nested component are outside of the MMI architecture
Ingmar: there is not necessarily
a user interface
... components could be a kind of abstract
[ Slides ]
Kaz presents http://www.w3.org/2007/Talks/1116-ink-ka/
Ali: in some applications you will need more information than the points, e.g. also time information
SriG: time is one possible channel
doug: what is the use of the tilt of a pen?
kaz: if we use a kind of brush, the tilt is very important
Shioya-san: can InkML help retrieving how a person feels, the person handwriting
SriG: InkML is independent of
such applications
... emotion or form filling can be done with a basic annotation
element in InkML
... or you can create a different langauge and embedded InkML
in that langauge
Ingmar: there is an incubator at W3C working on emotions, you might look into that
Shioya-san: it's a complicated topic
Debbie: InkML is like a WAV file
for audio. It is low-level representation
... e.g. hand-writing recognition is a different level than
InkML
[ Paper / Slides by Jokinen / Slides by Wilcock ]
[Kristiina Jokinen presenting]
http://www.w3.org/2007/08/mmi-arch/papers/w3positionpaper.pdf
[video demo of speech-recognition in noisy environment]
[Kristina walking us through slides]
MUMS - MUltiModal navigation System
Input Fusion (T. Hurtig)
Interact system / Jaspis (Tururen et al.)
agent-based system
... agent selection based on particular heuristics
Adaptive Agent Selection
"estimate evaluator" selects best available agent
Conclusions [slide]
includes "Tactile systems seem to benefit from speech recognition as a value-added feature"
Raj: can you elaborate on details of users?
KJ: Interesting thing:
Expectations biased user evaluation of the system ...
... if they though they were talking to a speech system that
only that this pen ...
... their reaction was bad ...
... but if they were working with a pen with speech as an added
feature ...
... their reactions were more positive ...
... but in both cases it was exactly the same system ...
Raj: for TTS over network, [latency is very annoying to users]
KJ: yes, that is a problem with
mobile applications [of this technology]
... one other thing we found is that when we divided users into
age groups ...
... interesting to note that young, under age 20 users
... and age 55+ users ...
... the young users really had positive expections ...
... but the middle-aged users had low expectations ...
MSh: But you also considered skill level of users?
KJ: Younger users are more prone
to different types of gadgets ...
... so they [seem to have] already collected experience
... need to consider adaptation of the system -- interaction
strategies -- when the users [have trouble with the system]
Ali: About gesture interpretation: What is the technique to select objects [on screens that are small and crowded with such objects]
KJ: If you have some good suggestions I would be happy to get them :)
[Ali provides some examples]
KJ: Have to take into account that the user can actually be wrong
Ali: I think this is actually a problem with all such systems
kaz: and we can talk about that kind of issues related to disambiguation using multimodalities in tomorrow morning at Session 6 :)
Ingmar: Is your system actually event-based?
KJ: yes, it it basically event-based and asynchronous
Ingmar: Looks like you have split up [the managers into your architecture] into smaller parts that communicate with each other [across the managers]
KJ: there are abstract devices
that these managers define
... each manager checks if the shared information storage is in
one of the states [that one of its agents expects]
Ingmar: any declarative languages for this system?
KJ: Java based...
Ingmar: How much of those
components are application-specific?
... have you looked to see if the workflow through the system
can be generalized?
KJ: though some of library components could be shared, most parts depend on each application
Ingmar: Is there a reason why you are not using EMMA?
KJ: when the first version of the system was built, we did not know about EMMA yet :)
Ingmar: If you could look into that and provide information back to the group, would be much appreciated
slide: Paradigm-breaking examples
Raj: The components per-se do not
have to be executing scripts ...
... so, there is already asynchronicity present
slide: Questions: how does an
Interaction Manager handle modifications to an ongoing process,
parallel processes?
... Possible New Lifecycle Interaction Modes
... [discussion of possible solution to deal with problems of
modify events and parallel events]
<kaz> so "modify" means some kind of "modifier" like [SHIFT] or [Ctrl]
<kaz> and "parallel" is kind of superimpose
slide: a benefit of this proposal is that it does not require developers to write asynchronous event handlers on modality components
Raj: You can use a privilege-based system to control access to, e.g., speakers
Skip: You don't want to always add the kind of overhead, especially for simple systems.
DD: The modify proposal reminds
me of DOM modification
... wondering if that might be too fine-grained
... might be able to get away with something a bit more
abstract
... so that interaction manager doesn't have to know how to
handle DOM modification
... If we wanted to implement that modify proposal, we would
need to think about what the API would look like
Florian: If the first "done" comes back too quickly, then you have a problem ...
Ingmar: maybe it makes sense to
place limits on how multiple starts [and such] are
handled
... [discusses other specific restrictions that would make this
more implementable]
<kaz> ... whether allow parallel/simultaneous/multiple invoking or not
<kaz> ... sharing/mixing media control
Skip: mixing and replication are
relatively simple
... keep in mind the media component really has only one input
stream and one output stream
<kaz> maybe Skip is suggesting sub Interaction Manager named "media stream controller"
DD: I note multiple mentions of
fusion ... something we have not discussed in the group for a
while [but maybe should]
... I also heard some interest in timing, at the level of the
Interaction Manager
Ingmar: Do you have a specific use-case for this that illustrates what you actually mean by "timing"?
[we note that Felix brought up this issue this morning]
Ingmar: This is covered in EMMA, actually
Felix: Does EMMA define identity of two sequences or other temporal relations, etc.?
DD: Basics are in EMMA, but it's not really defined.
shepazu: but it probably should not really be defined in EMMA
DD: kaz pointed out that EMMA has concept of relative timestamps
kaz: lattice functionality might be available for that purpose
DD: We have requirements from accessibility people related to timeouts
Raj: [to Felix] the information you would need is available from EMMA
Raj: I am looking for relations like these: http://en.wikipedia.org/wiki/Allen's_Interval_Algebra are they all available in EMMA?
shepazu: mentions case related to SMIL and slow-down
Skip: VoiceXML has features related to speed-up and slow-down
shepazu: the rate at which most real-world users of screen readers have their readers set is so fast as to be incomprehensible [to a mere mortal such as I]
DD: another use case for parallel
audio might be simultaneous translation
... and may have, e.g., 4, 5 different users listening to
different translations
KJ: about timing, I was wondering about error management, and about multiple users, wondering about collaborative situations where users can affect what other users receive
Ingmar: Looking at the current
spec, it is handling those cases.
... About the Fusion module, is this something that should be
handled inherently in the architecture?
... as a generic [part] of the framework
Raj: error handling is a big problem from the authoring perspective
Ingmar: Isn't this the point of common-sense guidelines, and don't we have a doc for that already?
DD: Do we expect the Interaction Manager to do too much?
KJ: Thank you for saying that.
Raj: <sync> tag, for example, is very useful
<kaz> topics: fusion of stream media, timing control, error handling, multiple users, accessibility, IM too much?, higher level language than SCXML?
masa: joint work with Prof.
Nitta
... using Multimodal Toolkit Galatea
... will mention some issues on W3C's MMI Architecture
... tackling another MMI Architecture for Japanese based
standardization
... viewpoint is rather research based while W3C's one is
implement based
... Galatea Toolkit includes: ASR/TTS/Face generation
... (note TTS and Face geration are managed by sub
manager)
... comparison to W3C's MMI Architecture
... Modality Component is too big for life-like agent...
masa:
consider several examples:
... e.g. 1 lip sync functionality using W3C's MMI Architecture
... e.g. 2 back channeling
... in both case too many connection between MC and IM would be
required
... another question is: fragile modality fusion and
fission
... how to combine inputs???
... e.g. speech and tactile info
... e.g. 2 speech and SVG
... contents planning is needed
... third question is: how to deal with user model?
... changing dialog strategy based on user experience
... history of user experience should be stored
... but where to store???
... how about going back to previous "multimodal
framework"?
... which has smaller MC
... in addition, should have separate transition: task,
interaction, fusion/fission
... started with analysis of use cases
... then clarified some requirements
masa:
and now would like to propose 6 layered construction
... now planning to release: trial standard and reference
implementation
... explains event/data sequece of the 6-layered model
... L1. I/O module
... L2. Modality component: I/F data is EMMA
... L3. Modality fusion: output is EMMA/Modality fission
... L4. Inner task control: error handling, default
(simple) subdialogue, FIA, updating slot
... L5. Task control: overall control
... L6. Application
masa: but do we need some
specific language for each layer?
... don't think so
... summary: propose this "Convention over configuration"
approach
kris: what is the difference between L4 and L5?
masa: client side vs. server side
raj: there is focused discussion session and which can talk about the details there
ingmar: what if you have small and many MCs?
masa: actually this architecture doesn't care the uni/multi modalities
skip: what do you mean by "W3C's MC is too large"?
masa: please see this lip synch
app
... there is no way for each MC directly interact
... but need to connect the IM
skip: so you mean TTS should be able to directly talk with other modality?
masa: exactly
skip: same issue I poiinted out
:)
... so you suggest lower level information connection?
masa: lower connection
implemented using layer 2
... can be handled by lower leverl e.g. devices
kaz: where should device specific information be hanled? layer 1?
masa: could be handled by upper layer? (sorry missed...)
ingmar: looking at the 6-layer
picture
... L4-L6 corresponds RF?
... why not L3-L6?
raj: thinks your model is better, e.g. it mentions user model/device model
masa: user model is actually vertical layer, though
ingmar: what is the relationship between layers and languages?
raj: not necessarilly explicit languages
debbie: any other questions?
srig: very useful for research :)
doug: SVG, WebAPI, CDF
contact
... SVG: not only desktop but also mobile
... Web API: DOM, script and dynamic content, auxiliary tech
e.g. Ajax, Selectors
... CDF: mixture of (X)HTML, SVG, MathML, etc.
... and WICD
... who are the consumers of MMI???
... 1. implementors
... 2. integration
... 3. client-side functionality
... some questions
... 1. is Ink API or ML?
... set of channels expressed in pointer event?
... 2. SVG as drawing app?
... pressure info as well
... 3. VoiceXML handles DOM?
... integration of Voice and SVG?
... 4. DCCI: geolocation, battery life, etc.
... 5. Multiple touch: specifying line using 2 simultaneous
touches etc.
... and multiple users
... challenging from DOM persective
<karl> http://www.youtube.com/watch?v=0awjPUkBXOU -> hacker using the Wii in a funny way.
doug: how to liaise?
... direct liaison or inspiration/use cases evaluation?
... SVG as data/event source
<karl> http://developer.apple.com/documentation/AppleApplications/Reference/SafariWebContent/HandlingEvents/chapter_8_section_2.html -> one finger event on the iphone
doug: users specification like
drawing a circle could be stored as an event
... distributed DOM
... intersection???
... what's the common infrastructure?
... DOM integration
... implementor focus
... how to encourage users/implementors?
karl: what doug talked is very
important.
... We have to capitalize on the hacking trends, on the
Web community developing already cool stuff and extend if
necessary.
... It will be a lot easier to work with than on a
separate thread.
doug: it's important to
understand what is happening within W3C
... e.g. InkML
... compatibility of specifications etc.
... liaison is important
... for implementors
debbie: on one hand we have PC
world, and on the other hand we have Voice world
... e.g. VoiceXML 2.X doesn't handle events
... the interection between these different worlds is
important/interesting/problematic
... from Voice viewpoint VoiceXML is nice guy for the candidate
modality
... while GUI modality is a bit difficult to handle
Raj: W3C originally considered
desktop modality...
... we have not been considering multiple
devices/modalities
... we had to handle some of the specs as a "black
box"...
... e.g. DOM
doug: CDF has two approaches:
reference and embeded
... in case of reference, "blackbox" approach should work
... security is another difficult topic
skip: Voice Browser WG is
considering event handling
... e.g. record, play, recognize
doug: DOM3 events have several essential commands
mike: we should
remember the importance of involving various web browser
vendors.
... maybe starting with plug-in might be a good idea
[ Paper / Slides (TBD) ]
raj: we've been waiting for a
state structure... SCXML looks useful
... we want to standardize these things so they work in the
real world
... we need to identify the challenges we face
raj: devices are getting more
advanced, but we need browser functionality to catch up
... will make slides available later
... our presentation includes a field study done by our
engineers
... regarding the functionality of the UIs
... major challenge is disparate functionality between
devices
... need embedded TTS and ASR client on device
... devices and applications must be generalized to different
tasks, challenge in aggregating functionality
... having functionality about device state (connectivity,
battery life) would be valuable for back end processing
... we have unsophisticated users ("blue collars") who need to
be able to communicate naturally ("I want that")
... our study really captures how necessary this is
... mixed mode is necessary in field (even if it's not
accessible)
... mixed-initiative input method lets you fill out multiple
fields at once (like an address, all in one breath)
... also combining voice and pointer events in a single
interaction
... this saves a lot of time... 8% efficiency increase
... we're XML-centric
... mobile devices don't have XML events and other key
technologies
... can't send events, so need other ways to solve the
problems
... Access and Opera don't let me pass events to local TTS/ASR
engines...
... one of our guiding principles is the ability to nest
components in a recursive way
[references interaction manager as example of component aggregation]
raj: one of the biggest
challenges is assuming that all the components boot up in a
consistent order, with no drop-outs
... this is important, for example, for modules for different
languages (English and Spanish)... so we have to query what the
current device capabilities are
... we use a kind of registry to show what components are
available, and we pass along our capabilities to the
application
... for example, if the user has pressed the mute button, we
don't want to send the sound down the line
... the problem with the encapsulation model is that it's
mediated through an Interaction Manager, and we need to have 2
modes (for offline vs. online behavior)
... at short notice
... while maintaining compliance to MMI architecture
[shows demos]
scribe: need simple-to-use
interfaces
... might need to send down specific file formats depending on
the device capabilities...
... different behaviors based on battery levels and roaming
status
... we want to benefit from the collective wisdom to come up
with the next generation of the spec
KJ: how important do you see the field study for this work?
raj: absolutely essential to have
an iterative approach
... lots of time is wasted training new users
... so as you're filling out the fields, it tells you how to
fill it out
TY: Access is a small browser
company based in Japan
... also do VoiceML browser
[talks about history of Mobile MM]
TY: OMA will terminate work item without Tech specs... only architectural work done
[chart: top mobile subscribers (China, US, Russia, India, Brazil, Japan), and PC vs. Mobile users in Japan (tight race)]
TY: Landscape: bandwidth, CPU,
memory all increased, mobile Linux emerged
... improving every day
... needs "spiral evolution" among services, content, and
end-users
... need more capabilities and ease of authoring
... obstacles in mobile use:
... encapsulation, multimodal contexts, and content
authoring
... generalization and superficial design cause problems,
environment changes faster than standards can catch up
... need to customize to user and context of the use case
... need to fill gap between general framework and real-world
uses
... need to solve authoring problems... intutive interfaces,
better testing, solve problem async processing between multiple
entities
... no conclusion, no silver bullet... industry will like to
see mobile multimodal apps
Raj: customers want more, but industry not delivering...
TY: yes, but it's also a matter of education... and mobiles are getting more powerful
raj: how is content authoring relevant to MMI... content-authoring vs. application-authoring
TY: same thing to me
Kaz: content as service
TY: browser is enabler
raj: are you still working with IBM with VoiceML?
TY: well, the product is here
raj: browser vendors are not forthcoming about future capabilities
TY: we focus on narrow markets
Felix: you say encapsulation is a hindrance... does the audience think encapsulation is vital to architecture?
Skip: when you have a distributed environment, it's necessary... not so if it's all on same device
Raj: and encapsulation is also important for distributed authoring
<kaz> topics: encapsulation, authoring
Debbie: may be authored by different people as well as at different times... VoiceML authors are specialized
TY: multimodal is close to user
interactions
... dependant on interaction context and user
<kaz> topics: user/app cotext
[ Slides ]
ST: I developed Web Map...
distributed platform for maps using SVG
... SVG Map consortium in Japan
... maps are one of KDDI's main businesses
... mobile devices are more powerful, so we are converging to
One Web
... for both desktops and PCs
... Web Map is killer Web app
<kaz> topics: one web=web on mobile
ST: but functionality is mostly proprietary services
<kaz> topics: static server + light viewer
ST: SVG allows for direct use
rather than larger infrastructure
... allows for offline and online use... scalable
<kaz> topics: mobile as a robust infrastructure
ST: requirements for Web are
lightweight US with rich functionality, simple static
server
... if MMI shoud be basic part of the WWW, it should also
satisfy these requirements
... needs to work on limited devices, standalone
... RDF integration can work on tiny computers if done well
DS: how are you using RDF on mobiles, which is verbose?
ST: microformats
DS: how are you using RDF?
ST: for geospacial
information
... developed in W3C
Debbie: are you using RDF instead of SCXML for Interaction Managment, is that right?
ST: it would be integrated with
SCXML as the data model
... expressed in query
[ Slides ]
SM: [describes pen input]
SM: different modalities (pen,
voice) are good for different purposes/people
... light pen predates mouse by 5 years
... more interest recently because of tablet PCs
... touch is also exciting lately
... lots of variety in pen devices
... passive and active styluses (stylii?)
... lots of variation on writing surfaces (size, orientation,
etc.)
[slide of various pen/writing devices]
SM: touch has many commonalities
with pen input
... different different operations and capabilities and
modes
... all indicated as different "channels"
... different UIs... pen-only, supplemented by pen, pen as
mouse
... pro: fine motor control
... con: limited by hand movements sequentially
<kaz> Pen functions: point, select, write, draw, gesture, sign, text, figure, add info to picture
SM: common functions: pointer,
selector, writing, drawing, gesture, sign...
... loosely broken down into modalities like tap, write,
gesture, draw
... including handwriting recognition
... can have buttons on barrel
... pen input as data... writing or drawing... uninterpreted in
raw form, can be interpreted if needed
... use cases: white boards, signatures
... pen input as annotation
... almost anything visual can be annotated
... "inline" overlaid, or "attached" as reference
... difficulty in precisely associating with source
content
... pen input as gesture
... can be use in combination with speech
... very popular form of input
... gestures can launch applications or trigger
functionality
... gestures are lightweight form of personalization (compared
to passwords, etc.)
<kaz> topics: gestures as commands for browsers
SM: gesturing can be
context-dependant or generalized... might use gesture over
shape to annotated or do other things to it
... pen input for text recognition is not the most popular...
pen input for IMEs is more popular
... different interfaces can be tapping or pattern-based... can
learn via muscle memory
... can use autocomplete for partial character input
... "combinational gesture and selection"
<kaz> topics: error correction (=disambiguation) using GUI in addition to Ink
SM: sketch-based
interfaces...
... draw rought square and turns into perfect square
... flow charting
... searchin image repositories
[demos "Fly Pen"]
<kaz> topics: Pen devices and InkML
SM: Ink + Other Modalities
... ink as note taking is old idea
<kaz> topics: multimodal applications and W3C's MMI Architecture (including InkML)
SM: for lectures, whiteboard,
brainstorming, photo sharing
... writing while speaking
... drawing + speech
... speech changes mode
... Gesture + speech (interpreted)
... maps: "put that there"
... Integration: tight coupling
<kaz> ink inofomation interpreted by application
SM: challenging to determine mode
automatically... circle could be mouse movement, selection
gesture, O or 0... leave as ink?
... contextural inferences (text field is data, button is
GUI)
... loose coupling
... app doesn't interpret pen input...
... abstractions into mouse movement
... scalable, but no access to rich ink data
... what is a pen? finger? wii (3d)? keitai with
accellerometers?
[summarizes]
JK: gives intro... Wii, twiddler
scribe: accelerometers (spatial
translation), optical input (cameras...rotation and
translation)
... single-handed operation
... at sensor level, 2 approaches
JK: uses: gaming, list navigation, gestures, context recognition
scribe: degree of tilting can give natural gestural input
IK summarizes chart regarding use cases
scribe: continuous vs. discrete, relative vs. absolute
JK: Deutsche Telecom uses in "MediaScout" to make suggestions for video/music content, allows you to rate it
scribe: voice and haptic
feedback
... uses camera to interpret movement, no additional hardware
needed
[video demo]
scribe: "device nodding" as
gesture
... for navigating and selecting in GUI
JK: implemented in Windows Mobile
scribe: GUI is HTML+Javascript
<karl> it reminds me of http://www.youtube.com/watch?v=A3MfQIswl3k
<karl> iBeer
<karl> ;)
JK: ActiveX detects device capabilities (inertial sensors vs. camera)
scribe: abstracts input out into event generation for predefined motion patterns
<karl> other interesting things with kinetics. Games - http://www.youtube.com/watch?v=Wtcys_XFnRA
JK: difficult to add new gestures
scribe: might be better to allow gesture libraries or gesture markup
JK: client-side interface now, but working on server-side option
scribe: using serverside
Interaction Manager using w3c MM Framework
... all browser-based
... calculation of system response important
... introduces latency
JK: client-side vs. server-side
scribe: client is faster, but
simpler, depends on device capabilties
... server more powerful, but introduces latency and needs
connection
... combination also possible
[discusses architecture]
JK: uses MMI-Lifecycle via EMMA structure... made small extensions
JK: not only kinesthetic, but haptic responses (vibration)
scribe: proprietary
extension
... provides feedback for gesture recognition
JK: open issues
scribe: latency
... pattern-matching rules
JK: conclusion
scribe: did study, users dig
it
... how to bring this functionality into today's browsers
... some trust issues
... how to bring into Web applications?
raj: how do you process multiple
gestures in event sequence, with regards to multiple starts
before an end
... especially regarding combinations with voice input?
JK: right now, only the abstracted event (up, down, left, right) is sent up to server for processing... based on pattern recognition
raj: you might introduce disparity between gesture events and synchronized voice input
JK: to some degree, can be done in parallel, where end point is when the user stops moving
KJ: how can you make distinction between gestures and noisy movement?
JK: yeah, that's a problem
KJ: might have a threshold for movement
DS: is there a mode selector
JK: yes, a "push-to-move" button
Kaz: also can be done with speech
DS: does the camera use a lot of battery power? can you use the camera at a lower-level software than a move?
FM: not at this time, but we're looking at it
SM: why cameras vs. inertial? and what's haptic vs kinesthetic?
JK: camera is already in phone
KM: haptic is touch (feedback),
kinesthetic is movement (interface and input)
... are people interested in this problem? games and such use
proprietary means to do this
DS: yes, but we need to make these events available in specs that desktop browsers are implementing, with limited scope
Karl: let's reuse existing components
JK: we had to do this in proprietary way... no standard
<karl> http://lifehacker.com/software/gaming/gaming-with-the-macbooks-accelerometer-181203.php
MG: now we expose hardware interfaces through ActiveX control
scribe: if you don't want to use
activeX, OS vendor has to expose these some other way...
... do you plan to do your interface in Symbian, for
example?
JK: not at this time
FM: Android OS is supposed to expose this as well
MG: I'll give feedback from this workshop to our PM in China
raj: any toolkits you know for QBI (querying by image)?
SM: the most advanced ones are not open... still being explored
IK: are there any of these interfaces using inkML?
SM: most matching is done by nearest-neighbor algorithm... may need to have mediating format for InkML
KJ: fascinating, but from a user
point of view, how can each of these modalities be optimized
per task?
... would be nice to have a study
SM: there's been some research
into speech+gesture, with some categorization of people who use
each or both... depends on cognitive mode
... people use modality they are most comfortable with
... saw a paper on wheelchair-bound pen-control of a smart
house, but most users used modes they were more familiar with,
like voice
... task complexity has a lot to do with it... switching on a
light is easy with a pen, but using a map to query for
direction is a more complex task
... it would be best to create models per user
... one objective of this group is to integrate different
modalities into standard interfaces
raj: the MMI architecture is not just for speech or pen or other specific modalities, but is open to different modalities, based on lifecycle event
Debbie: we need to be higher
level, operating on notification basis
... we don't want to define universal APIs
Skip: if we can't do this
standardizing, then someone else will
... it's useless unless devices standardize
Debbie: exisiting speech APIs are
diverse
... hard to standardize
<kaz> MRCP is a protocal for speech interface
Skip: we have to work at a device-independent level
Debbie: might be outside the scope of MMI
Skip: I think someone needs to do it, not sure who
[ Slides ]
dai: IBM is interested in MMI
dai: I/O device capability
... mobile/desctop/phone
... application use case
... navigation on mobile phone
... framework: client/IM/voice server/web server
... requirements: how to describe device capability
... how to transfer that information to apps
... proposal: device capability definition
... some XML specification
... proposal2: dynamic modality change
... Re-initialization of device
... user setting/preference
... modality change trigger
... interface between device and MC
... proposal3: lifecycle event: Data event should contain
device state change
... proposal4: device dependent state control
... summary:
... dynamic switching of state control
... esp. timing of behavior change
... session control
raj: not sure how this proposal is related to DCCI
ingmar: client driven change of modality etc.
raj: device status need to be
watched then
... asynchronously
... in addition, need to have device context/capability though
IM
ingmar: device capability ontology activity by UWA WG might be useful
raj: for device controller
debbie: how to make IM
responsible for device management?
... all kind of transition/information on device capability
need to transfer to IM?
raj: you don't need to communicate all the details to IM (to the presenter)
ingmar: the point is how to handle device capability with MMI Architecture and its event
debbie: please revisit the figure mentioning the events
(presenter moves to the page)
kaz: what is your concrete
proposal/idea on device capability management?
... who do you think should manage that information
... and how?
raj: current DCCI has no support on how to transfer device information to IM etc.
debbie: this picture seems nice use of events for device infomation
doug: related to DOM
raj: that's more implementation question
<ddahl> actually, i think the StatusRequest event might be a good way to query devices about their capabilities
doug: curious how browser vendors are interested in this
topics: how to integreate device capability handling to web browsers
[ Slides (TBD) ]
kunio: various modalities
... home network
... existing protocol: DLNA, HDMI
... 10 foot UI
... something better than text input needed: voice, ink?
... digital TV
... evaluation: key stroke level model, speech act theory
... digital TV menu
... summary:
... mobile phone novel: Japanese layout?
... digital TV interfaces
tsuneo: digital TV is important use case
kris: interested in speech act theory
kunio: it's refered for Web based manual for that purpose
srig: (scribe missed srig's comment)
<ddahl> i think that's the sense he's using
kaz: interested in evaluation
kunio: basically keystroke based
evaluation
... maybe need 2-3 years, though
kaz: do you think there is any Japanese specific specification/mechanism for MMI?
kunio: input JP character (Kanji
etc.) is difficult
... Ruby or vertical writing might be also difficult
felix: scinario for one hand control might be useful considering we need to hold the device with one hand and can use only another hand for input
kaz: I will go through the issues
raised during the two days discussions.
... do you want to add things to this list?
debbie: do you think that dcci issues are part of it?
kaz: good point.
debbie: we should do
metacategories for this list.
... and we also talked about APIs like modalities APIs
... something like life cycle events, EMMA.
raj: we want to add also two
things
... 1. Interaction Manager
... 2. somethingA
debbie: can you add Encapsulation
and tight
... /loose coupling
kaz: We can categorize the topics
and discuss some of them.
... We could create an authoring category
kaz lists categories on the whiteboard given by participatns
* Authoring
* Integration and APIs
* Distribution and Architecture
* User experience and accessibility
ingmar: We wanted for example know how to describe motion components, which create one modality by itself.
debbie: Speech grammar is out of scope
kaz: then maybe it should be
modality component specific grammar.
... where should we put it? User experience?
debbie: isn't it authoring?
[kaz continues to go through the list of issues]
http://esw.w3.org/topic/MMIWorkshop
skip: Interaction manager is
a messaging infrastructure
... How do you discover particular components?
... How a component is being self-advertised?
... part of the issue is a publishing/advertisement messaging
infrastructure framework.
... GMS has one for example
... and we do not have to define it at all because it is
defined in other standards.
http://www.w3.org/TR/dd-landscape/
debbie: are there the kind of things that device description already gives?
raj: hmm not only
skip: It is not only about
technical things like screen, but also audio, smell, haptic,
etc.
... some modalities components have different samplings
... and depending on the devices you want to change the
sampling modalities
debbie: I'm not sure if it's a
good match
... but the ubiquitous group has defined ontologies of
devices
... we should really look at the work of this group
skip: There is a lot of diversity in the description of components and modalities
kaz: Do you think that device capabilities management should be part of the Architecture?
skip: for example, an audio device can record a sound file. And that's just it
ingmar: I would put it in the architecture domain
kaz: topic on synchronization
http://www.w3.org/2001/di/IntroToDI.html
An Introduction to Device Independence
debbie: I thought i heard something persistence.
raj: I need to know the
preferences of each user
... the interaction manager requests how the data should be
presented.
sriganesh: if you want to change the full user experience, it has been to deal at the interaction manager level.
raj: user preferences could
affect more than one devices
... sequences are business logic, it is not at application
level
skip: The interaction manager has to know the user preferences anyway
raj: I have a different idea of
an interaction manager
... for me it is only a message broker between different
modalities components.
skip: Then we need an application tower in the system
debbie: We have everything intertwingled now, we want to find the different location of parties. The information from applications has to be some places.
ignmar: what is the interaction manager?
raj: we have to define what you mean by interaction manager
speakerA is explaining what he thinks about interaction management
debbie: This discussions shows that we have to pinn down what interaction manager means
[participants explaining what are their different ideas of what is an interaction manager]
shepazu: could you define what is life cylce events
debbie: stop, pause, forward,
etc.
... we busy working on how the modalities components work?
shepazu: a tutorial should help to understand the technology
kaz: what do you mean by "tutorial"?
... primer kind of document for implementors/authors?
shepazu: something which explains the vocabulary.
debbie: did you read the spec?
shepazu: I skimmed through
debbie: Ingmar is working on something.
karl: a 101 document would be very good.
debbie: yes we had a few times the request and we might be not understood by other people
raj: it happened that didn't
understand what we said.
... ambiguities for example.
[ Issue list gathered through this workshop ]
kaz goes through the list
Debbie: need to discuss: what is
in the interaction manager?
... Kristina has pulled out some functions of the IM
... Prof. Nitta had some functions too.
Florian: Depending on your
approach (e..g hierarchical, agent based)
... you will have different approaches. We might not find
consensus here
Debbie: just want to list functions, not evaluating them
Kristiina: we know from current
appls that there is a need for discussion of IM
... e.g. application specific IMs
... application and interaction data is closely connected
... however they need to be separate components / modules
Debbie: message routing also an issue
Ellis: for me there is message router, sensor, thing which generates messages at application level
Kristiina: need space in
architecture and implementations, since each application needs
some specifics
... we should not decide things too closely
Debbie: "modality components communicating directly" issue?
Ellis: depends on (physical) distance between components whether they can (should) communicate directly
Raj: application author decides this
Ellis: MM component could
subscribe to a specific event of a different component
... depends on the intelligence of the component(s)
Debbie: accessibility topic?
Raj: it's an authoring
issue
... may not always conform to accessibility guidelines
Ellis: let the user pick his modality
Doug: e.g in a noisy
environment
... the more modalities identical, the easier they are to
author
Ellis: developer specifies high-level requirements, e.g. "give account balance", and modality components implement that on lower lever
Debbie: we talked about time outs and accessibility, don't remember other aspects of accessibility
Karl: haptic and kinetics accessibility issues?
Debbie: yes, like mouse click problems
Karl: or voice problems
Ellis: modality component should deal with that
srig: standard names for input and output modalities
doug: don't know how to integrate
the work of MMI with my world (e.g. DOM, SVG)
... e.g. events in DOM
Ellis: voice xml could generate necessary events
Debbie: voice xml 2.1 browser can be controlled by CCXML
Doug: how will a script author
approach that?
... person talked into a phone, I will print it on the
screen
... we have all pieces, need to put them together
Ellis: present the scenario to
voice xml WG
... they have all pieces to put this together
Debbie: voice browser have to
respect inline markup
... currently all needs to be in an voice xml form
Doug: how to integrate InkML? How
to integrate voiceML? How to describe device
capabilities?
... these questions are important for web application authors
(our main market)
... serialization of InkML in SVG might be interesting
Ellis: vxml needs granularity, but would be good too to send SVG to a recognizer and get the text
Doug: having an abstract layer would be good for the application author
Ellis: yes, he could enable all
modality components
... spoke, pen etc. the words "get my account balance"
... user preferences show how the user wants the information
back
... that's the ideal architecture
... components create a semantic event up, business logic
processing happens, and the result is sent back
Kristiina: you are talking a lot
about text, but we are handling with a lot of non-textual
information
... people may be pictures and send out photographs
... e.g. museum application
Ellis: picture of three people,
saying "who is who?". I send it to my appl. and get the
information back
... for me that's the same way of processing
Debbie: EMMA provides the semantics
Doug: would like to have REX 2 to send events
Debbie: not sure about the status
of REX
... original REX has some problems
Ellis: two different
entities:
... modality component takes handwriting and voice
... but recognition algorithms could be independent of these
components
Debbie: looks like a BP for
specific systems
... but not s.t. general for MMI architecture
Ellis: Doug's use case is to just
to send information back and forth without thinking of
details
... at the business logic level
srig: application manager can send already scripts to modality components , so we provide this functionality already, no?
Debbie: MMI architecture allows you to have more fine grained voice components, it's just not part of voice XML
Ellis: would like to be independent of specific modality components
Debbie: so a light-weight API
Ellis: yes
Doug: not necessary to send events to DOM, but other things are possible too
Ellis: I like idea of
light-weight interface
... e.g. one simple process of recognition, conversion, TTS
etc
Doug: with such an API you should be able to say "I will give you this, I will get back that"
Felix: question on time line
looking at http://www.w3.org/2006/12/mmi-charter.html
Karl: did you look at what is existing in the industry?
Debbie: looked at galaxy
architecture, developed in 90's (DARPA)
... not many people publish commercial MM architectures
... there was SALT and something else at the beginning of MMI,
but that did not fit
... because of need to work distributed and have extensibility
for new modalities
Raj: browser-centered view of the universe is not flying for us
Kristiina: agree with last
point.
... e..g many universities have built their own systems, it is
interesting to compare these
Kaz: concludes the workshop with explaining the possible next step as follows
[Wrokshop ends]
The Call for Participation, the Logistics, the Presentation Guideline and the Agenda are also available on the W3C Web server.
Deborah Dahl and Kazuyuki Ashimura, Workshop Co-chairs
$Id: minutes.html,v 1.8 2007/12/10 12:58:56 ashimura Exp $