Multimodal interaction and XForms

Giving users a chance to choose between using their ears or eyes, their voice or their fingers

Dave Raggett <dsr@w3.org>

W3C Fellow on assignment
from Openwave Systems





background photo by Gerald Saunders, see www.wetsand.com

Multimodal Interaction

Enabling users to access the Web via multiple modes of interaction

Why?

Example use cases

Natural language as a user interface

Natural language has the potential to offer a compelling alternative to conventional user interfaces, for example:

Computer: Welcome to Joe's Pizza
Computer: What would you like?
User: I would like two large pizzas with mozzarella
and one small pizza with tomatoes and anchovies
Computer: would you like any drinks with that?
User: Sure, 3 large diet cokes, oh and add
 pepperoni to the large pizzas
Computer: Is that all?
User: yes
Computer: Okay, that will be ready for you
 in 5 minutes
User: thanks

This is impractical today, due to the difficulties in determining all the ways people could respond. More flexible natural language understanding is needed to deal with this.

A more realistic example

Current applications use carefully chosen prompts designed to elicit simple responses, for example:

Computer: Welcome to Joe's Pizza ordering service
Computer: Select pizza size from large, medium
 or small?
User:  large
Computer: what number of these pizzas do you want?
User: two
Computer: Select first topping from mozzarella,
 pepperoni and anchovies?
User: mozzarella
Computer: Do you want another topping, yes or no?
User: yes
Computer: Select second topping from mozzarella,
 pepperoni and anchovies?
User: pepperoni
Computer: Do you want any other pizzas, yes or no?
...   

Kinds of multimodal interaction

Interaction limited to specific modality:

User chooses which modality to employ:

Input involves joint use of multiple modalities

User preferences and situational context

Effect of varying capabilities on architecture

Desktop computer with integrated large vocabulary speech recognition capabilities

PDA with limited storage

Low-end phone without any speech recognition capabilities

Other possibilities include using local recognition for standard navigation/control commands (ETSI STF 182) and remote speech services for other purposes

Events as the glue between modalities

XML Events can be harnessed to link devices and modalities

Distributed architecture for events

Example events:

Giving authors control over synchronization

Dialog Models

Synchronous versus asynchronous input:

Error recovery mechanisms:

Goal driven dialogs:

Actions as conditional changes to instance data

XForms provides means to query instance data via XPath and to alter instance data via mutation operators

User: I would like two large pizzas with mozzarella
and one small pizza with tomatoes and anchovies
Computer: would you like any drinks with that?
User: Sure, 3 large diet cokes, oh and add
 pepperoni to the large pizzas

Speech interpretation yields if/then actions on instance data

Both the if and the then need access to a wider context than the XForms instance data, for example, user preferences, system settings, the user interface and more.

XForms and multimodal interaction

Testimonial to the power of this technology