Multimodal Interaction

A vision of the opportunities and
the standards needed to enable them

Dave Raggett <dsr@w3.org>

W3C Fellow on assignment
from Canon

W3C Mission

Extending the Web to allow multiple modes of interaction
- GUI, Speech, Vision, Pen, Gestures, Haptic interfaces, ...
- Different equipment
Augmenting human to computer and human to human interaction
- Communication services involving multiple devices and multiple people
Anywhere, Any device, Any time
- Services that dynamically adapt to the device, user and environmental conditions
Accessible to anyone

Mobile — Phones and PDAs

Convenient choice of modalities for every situation
- Input with speech, audio, keystrokes, and pen/stylus
- Output with display, speech and audio
Additional capabilities
- Camera for photos and video
- Location and orientation information
- Biometric authentication: voice, image, fingerprint
Embedded or distributed architectures
- Interaction management and speech processing
- Multiple devices and protocols

Automotive and Telematics

Telematics — networking the car
Emergence of high resolution color displays integrated into the dashboard
Hands free access for driver
- Buttons and softkeys for passenger or when stationary
Must work in extremes of temperature/humidity
New frontier for mobile applications
- Embedded and distributed applications
- Entertainment, information, communications, navigation

Multimodal in the Office

Multimodal interfaces have benefits for
- Desktop work stations
- Wall mounted interactive displays
- Multi-function copiers
- And other office equipment
Use of mobile devices in office environment
- Syncing appointments/address books
- As preferred personal interface for control of other resources

Multimodal in the Home

Home PCs
Living room home entertainment systems
Games systems
Mobile devices
Other embedded devices

History

Initial work on requirements in Voice Browser WG
- Also draft NLSML spec, picked by IETF for SpeechSC
Joint MMI workshop with W3C/WAP Forum in 2000
W3C Multimodal Interaction WG launched in 2002
Member contributions on SALT, X+V
Work on use cases and requirements
W3C Multimodal Interaction Framework
Draft specs for EMMA and InkML
MMI WG now in the process of rechartering

Who is currently involved?

Access, Alcatel, Apple, Aspect, AT&T, Avaya, BeVocal, Canon, Cisco, Comverse, EDS, Ericsson, France Telecom, Fraunhofer Institute, HP, IBM, INRIA, Intel, IWA/HWG, KAIT, Kirusa, Loquendo, Microsoft, Mitsubishi Electric, NEC, Nokia, Nortel Networks, Nuance Communications, OnMobile Systems, Openstream, Opera Software, Oracle, Panasonic, ScanSoft, Siemens, SnowShore Networks, Sun Microsystems, Telera, Tellme Networks, T-Online International, Toyohashi University of Technology, V-Enable, Vocalocity, VoiceGenie Technologies, Voxeo

Multimodal Interaction Framework

A level of abstraction above an architecture

MMI Framework

MMI Framework - details

A level of abstraction above an architecture

MMI Framework

MMI Framework - details

A level of abstraction above an architecture

MMI Framework

EMMA

Extensible Multi-Modal Annotation
- Input processors express interpreted input as XML
- Annotations for confidence scores, N-best lists, time stamps, mode of input etc.
Supercedes earlier work on NLSML
Semantic Interpretation language
- Used together with SRGS (speech recognition grammars)
Designed for network and embedded use
- No need to serialize to XML for embedded case
  - Instead transfer as in-memory data structures

Modality Interfaces

Abstract software interface between I/O processor and host environment
- OMG IDL interfaces for W3C DOM
Voice Browser WG for Voice/DTMF
MMI WG for ink and keystrokes
Use of grammars to constrain input
- Robust speech and handwriting recognition
Interoperability challenge for pen gestures
- Standardize at level of interpretations?
- Meaning of “tap” is context sensitive

Interaction Management

Studying range of approaches to identify opportunities for standardization
- Possible approaches may include
  - Simple event handlers
  - State machines
  - Domain specific task and data models
High level language or low level kernel?
- Still at an early stage of discussion

Composite Multimodal Input

Combination of speech and pen gestures
- What restaurants are in this area?
  - Pen is used to circle area on map
- Print these
  - Pen used to select thumbnails of photos
Studying a range of approaches
- For example
  - Simple hooks
  - Semantic integration
  - XSLT

Dynamic Adaption

Adapting to device, user and environment
- Low battery alerts
- Loss of network connectivity
- User mutes microphone
- Snapping camera onto cell phone
- Bluetooth to couple camera and printer
System and Environment framework
- IDL interfaces for a hierarchy of properties
- Complements Device Independence WG work on
  - Core presentation attributes
  - CC/PP for server-side content selection/adaptation

System and Environment

Property hierarchy

Namespaces isolate definitions from different organizations

Sessions

Dynamic configurations and distributed applications present new challenges to Web developers
Sessions can help with:
- Basis for subscribing to events
- Synchronizing data
- Hiding details of protocols and addressing mechanisms
Layering of concerns
- Keeping application markup simple
- Building on top of existing mechanisms
  - SIP, Web Services, de facto industry specific solutions

InkML

XML transfer format for ink traces as part of multimodal applications
Allows server-side processing of
- Drawings
- Handwriting
- Gestures
- Signature verification
- Specialized notations like math, music and chemistry
Developed with the help of Apple, Corel, Frauhofer Gesellshaft, HP, IBM, Intel and Motorola

Multimodal Interaction

W3C Mission

Mobile — Phones and PDAs

Automotive and Telematics

Multimodal in the Office

Multimodal in the Home

History

Who is currently involved?

Multimodal Interaction Framework

MMI Framework - details

MMI Framework - details

EMMA

Modality Interfaces

Interaction Management

Composite Multimodal Input

Dynamic Adaption

System and Environment

Sessions

InkML

Questions?