Multimodal interaction and XForms
Giving users a chance to choose between using their ears or
eyes, their voice or their fingers
Dave
Raggett <dsr@w3.org>
W3C Fellow on assignment
from Openwave Systems
background photo by Gerald Saunders, see www.wetsand.com
Multimodal Interaction
Enabling users to access the Web via multiple
modes of interaction
Why?
- Allow people to choose when to use their eyes, hands, ears and
mouths
- Speaking is so much easier than thumbing in text, and
useful when you can't see the screen clearly
- Sometimes speech is not the answer - its too noisy or its
inappropriate to speak
- Complementing transient nature of speech with longer lived
visual information
- Ink: The use of a stylus for text input, gestures, specialized
notations such as math, music and chemistry, and for diagrams and
artwork
Example use cases
- Name dialling - see who you want to call
- Seeing and hearing as means to browse messages
- Driving directions
- Search, e.g. for flight information
Natural language as a user interface
Natural language has the potential to offer a compelling
alternative to conventional user interfaces, for example:
Computer: Welcome to Joe's Pizza
Computer: What would you like?
User: I would like two large pizzas with mozzarella
and one small pizza with tomatoes and anchovies
Computer: would you like any drinks with that?
User: Sure, 3 large diet cokes, oh and add
pepperoni to the large pizzas
Computer: Is that all?
User: yes
Computer: Okay, that will be ready for you
in 5 minutes
User: thanks
This is impractical today, due to the
difficulties in determining all the ways people could respond.
More flexible natural language understanding is needed to deal
with this.
A more realistic example
Current applications use carefully chosen prompts
designed to elicit simple responses, for example:
Computer: Welcome to Joe's Pizza ordering service
Computer: Select pizza size from large, medium
or small?
User: large
Computer: what number of these pizzas do you want?
User: two
Computer: Select first topping from mozzarella,
pepperoni and anchovies?
User: mozzarella
Computer: Do you want another topping, yes or no?
User: yes
Computer: Select second topping from mozzarella,
pepperoni and anchovies?
User: pepperoni
Computer: Do you want any other pizzas, yes or no?
...
Kinds of multimodal interaction
Interaction limited to specific modality:
User chooses which modality to employ:
- "Make your selection?" (tap on menu or say choice)
- "Please type or say your account number?"
Input involves joint use of multiple modalities
- "How far is it to here?" (tapping on a map location)
User preferences and situational context
- Under the circumstances, you want your PDA to be quiet
- It's sunny and the color screen is impossible to read
- When the car is in motion, it is unsafe to operate the keypad
- Your current location is made available to an application
- Selected personal details are provided to an application
Effect of varying capabilities on architecture
Desktop computer with integrated large vocabulary speech
recognition capabilities
- Everything occurs locally, with network used to retrieve
resources
PDA with limited storage
- PDA provides browser for XHTML plus speech extensions
- Network based speech recognition and synthesis
- Interpretation of recognition results occurs locally
Low-end phone without any speech recognition capabilities
- Phone provides XHTML browser
- Network based speech recognition and synthesis
- Interpretation of recognition results occurs remotely
Other possibilities include using local
recognition for standard navigation/control commands (ETSI STF 182)
and remote speech services for other purposes
- Can we hide this from authors?
Events as the glue between modalities
XML Events can be harnessed to link devices and modalities
- Tapping on a button initiates an event
- Event listener receives event and handles it appropriately
- Events can be used to trigger audio prompts and to enable
speech recognition for a given grammar
- Recognition events can be used to stop a prompt (barge in)
- Events can be used to change XForms instance data, to update the
user interface, and to load a new page, etc.
- Extensible event model allows for application defined events
- SMIL timing model allows for events to be triggered by timed offsets
from other events
- Time stamps needed for resolving synchronization issues
Distributed architecture for events
- Today XML events are limited to the same device
- We need to extend events to operate across the Web
- A given event may have listeners on multiple devices
- W3C is focussing on author's perspective, leaving transport details
to other organizations, (IETF, 3GPP, etc.).
- Low level transport as XML messages via SIP with session based
register/notify model
- W3C needs to define the XML message format for a range of events
based upon study of common use cases
Example events:
- button click - event needs to convey which button was clicked
- speech recognition: interpretation event conveys what actions are
needed, e.g. set values for one or more form fields
- change to instance data - user interface listens to ensure page
is refreshed to reflect changes
Giving authors control over synchronization
- Imagine having the ability to make a choice via speech or
via a set of radio buttons.
- If you speak and click, which choice wins?
- If state is stored in one place only, then a simple approach
will suffice
- If state is duplicated in more than one device, then time stamps
can be used to ensure state is maintained consistently
- Authors may want control over synchronization, so it should be
possible to override default behavior
- Session and page identifiers may be needed in addition to simple
time stamps to prevent delivery of event to wrong handler
Dialog Models
Synchronous versus asynchronous input:
- Prompted input — user is expected to respond within
a given period of time, the recognizer is active for a limited
time and initiates a no input event if the user hasn't
responded by that time
- Unprompted input — user is free to say a command at
any time, and the recognizer remains active indefinitely
Error recovery mechanisms:
- User doesn't respond in time
- User's response cannot be understood
- User's response is incorrect
Goal driven dialogs:
- Direct user to fill out form, field by field (VoiceXML FIA)
- XForms relevancy mechanism determines what fields
need to be completed based upon earlier input
- What kinds of markup are needed for multimodal authoring?
Actions as conditional changes to instance data
XForms provides means to query instance data via XPath
and to alter instance data via mutation operators
User: I would like two large pizzas with mozzarella
and one small pizza with tomatoes and anchovies
Computer: would you like any drinks with that?
User: Sure, 3 large diet cokes, oh and add
pepperoni to the large pizzas
- First response creates instance data representing
an order for 3 pizzas
- Second response references earlier choices and
alters the order
Speech interpretation yields if/then actions on instance
data
Both the if and the then need
access to a wider context than the XForms instance data, for example,
user preferences, system settings, the user interface and more.
XForms and multimodal interaction
- Multimodal brings added complexity of distributed architecture
- XML Events can be used to initiate actions and to convey notifications
when actions have occurred
- Speech can update many fields at once and is qualitatively different
from familiar graphical user interfaces
- Opportunities for declarative dialog modelling, via multimodal
extensions to VoiceXML, or new approaches
- Error recovery is an essential aspect for speech based systems
- Electronic ink offers exciting new opportunities for the Web
- W3C's Multimodal Interaction working group has only recently started
and it will take quite a while to study the use cases and to develop
the corresponding specifications — your help is welcomed, see:
http://www.w3.org/2002/mmi
for more details
Testimonial to the power of
this technology