Multimodal interaction and XForms

Giving users a chance to choose between using their ears or eyes, their voice or their fingers

Dave Raggett <dsr@w3.org>

W3C Fellow on assignment
from Openwave Systems

background photo by Gerald Saunders, see www.wetsand.com

Multimodal Interaction

Enabling users to access the Web via multiple modes of interaction

Why?

Allow people to choose when to use their eyes, hands, ears and mouths
Speaking is so much easier than thumbing in text, and useful when you can't see the screen clearly
Sometimes speech is not the answer - its too noisy or its inappropriate to speak
Complementing transient nature of speech with longer lived visual information
Ink: The use of a stylus for text input, gestures, specialized notations such as math, music and chemistry, and for diagrams and artwork

Example use cases

Name dialling - see who you want to call
Seeing and hearing as means to browse messages
Driving directions
Search, e.g. for flight information

Natural language as a user interface

Natural language has the potential to offer a compelling alternative to conventional user interfaces, for example:

Computer: Welcome to Joe's Pizza
Computer: What would you like?
User: I would like two large pizzas with mozzarella
and one small pizza with tomatoes and anchovies
Computer: would you like any drinks with that?
User: Sure, 3 large diet cokes, oh and add
 pepperoni to the large pizzas
Computer: Is that all?
User: yes
Computer: Okay, that will be ready for you
 in 5 minutes
User: thanks

This is impractical today, due to the difficulties in determining all the ways people could respond. More flexible natural language understanding is needed to deal with this.

A more realistic example

Current applications use carefully chosen prompts designed to elicit simple responses, for example:

Computer: Welcome to Joe's Pizza ordering service
Computer: Select pizza size from large, medium
 or small?
User:  large
Computer: what number of these pizzas do you want?
User: two
Computer: Select first topping from mozzarella,
 pepperoni and anchovies?
User: mozzarella
Computer: Do you want another topping, yes or no?
User: yes
Computer: Select second topping from mozzarella,
 pepperoni and anchovies?
User: pepperoni
Computer: Do you want any other pizzas, yes or no?
...

Kinds of multimodal interaction

Interaction limited to specific modality:

"Please say your name?"

User chooses which modality to employ:

"Make your selection?" (tap on menu or say choice)
"Please type or say your account number?"

Input involves joint use of multiple modalities

"How far is it to here?" (tapping on a map location)

User preferences and situational context

Under the circumstances, you want your PDA to be quiet
It's sunny and the color screen is impossible to read
When the car is in motion, it is unsafe to operate the keypad
Your current location is made available to an application
Selected personal details are provided to an application

Effect of varying capabilities on architecture

Desktop computer with integrated large vocabulary speech recognition capabilities

Everything occurs locally, with network used to retrieve resources

PDA with limited storage

PDA provides browser for XHTML plus speech extensions
Network based speech recognition and synthesis
Interpretation of recognition results occurs locally

Low-end phone without any speech recognition capabilities

Phone provides XHTML browser
Network based speech recognition and synthesis
Interpretation of recognition results occurs remotely

Other possibilities include using local recognition for standard navigation/control commands (ETSI STF 182) and remote speech services for other purposes

Can we hide this from authors?

Events as the glue between modalities

XML Events can be harnessed to link devices and modalities

Tapping on a button initiates an event
Event listener receives event and handles it appropriately
Events can be used to trigger audio prompts and to enable speech recognition for a given grammar
Recognition events can be used to stop a prompt (barge in)
Events can be used to change XForms instance data, to update the user interface, and to load a new page, etc.
Extensible event model allows for application defined events
SMIL timing model allows for events to be triggered by timed offsets from other events
Time stamps needed for resolving synchronization issues

Distributed architecture for events

Today XML events are limited to the same device
We need to extend events to operate across the Web
A given event may have listeners on multiple devices
W3C is focussing on author's perspective, leaving transport details to other organizations, (IETF, 3GPP, etc.).
Low level transport as XML messages via SIP with session based register/notify model
W3C needs to define the XML message format for a range of events based upon study of common use cases

Example events:

button click - event needs to convey which button was clicked
speech recognition: interpretation event conveys what actions are needed, e.g. set values for one or more form fields
change to instance data - user interface listens to ensure page is refreshed to reflect changes

Giving authors control over synchronization

Imagine having the ability to make a choice via speech or via a set of radio buttons.
If you speak and click, which choice wins?
If state is stored in one place only, then a simple approach will suffice
If state is duplicated in more than one device, then time stamps can be used to ensure state is maintained consistently
Authors may want control over synchronization, so it should be possible to override default behavior
Session and page identifiers may be needed in addition to simple time stamps to prevent delivery of event to wrong handler

Dialog Models

Synchronous versus asynchronous input:

Prompted input — user is expected to respond within a given period of time, the recognizer is active for a limited time and initiates a no input event if the user hasn't responded by that time
Unprompted input — user is free to say a command at any time, and the recognizer remains active indefinitely

Error recovery mechanisms:

User doesn't respond in time
User's response cannot be understood
User's response is incorrect

Goal driven dialogs:

Direct user to fill out form, field by field (VoiceXML FIA)
XForms relevancy mechanism determines what fields need to be completed based upon earlier input
What kinds of markup are needed for multimodal authoring?

Actions as conditional changes to instance data

XForms provides means to query instance data via XPath and to alter instance data via mutation operators

User: I would like two large pizzas with mozzarella
and one small pizza with tomatoes and anchovies
Computer: would you like any drinks with that?
User: Sure, 3 large diet cokes, oh and add
 pepperoni to the large pizzas

First response creates instance data representing an order for 3 pizzas
Second response references earlier choices and alters the order

Speech interpretation yields if/then actions on instance data

Both the if and the then need access to a wider context than the XForms instance data, for example, user preferences, system settings, the user interface and more.

XForms and multimodal interaction

Multimodal brings added complexity of distributed architecture
XML Events can be used to initiate actions and to convey notifications when actions have occurred
Speech can update many fields at once and is qualitatively different from familiar graphical user interfaces
Opportunities for declarative dialog modelling, via multimodal extensions to VoiceXML, or new approaches
Error recovery is an essential aspect for speech based systems
Electronic ink offers exciting new opportunities for the Web
W3C's Multimodal Interaction working group has only recently started and it will take quite a while to study the use cases and to develop the corresponding specifications — your help is welcomed, see: http://www.w3.org/2002/mmi for more details

Testimonial to the power of this technology