W3C

Multimodal Interaction Use Cases

W3C NOTE 4 December 2002

This version:
http://www.w3.org/TR/2002/NOTE-mmi-use-cases-20021204/
Latest version:
http://www.w3.org/TR/mmi-use-cases/
Previous version:
this is the first publication
Editors:
Emily Candell, Dave Raggett

Abstract

The W3C Multimodal Interaction Activity is developing specifications as a basis for a new breed of Web applications in which you can interact using multiple modes of interaction, for instance, using speech, hand writing, and key presses for input, and spoken prompts, audio and visual displays for output. This document describes several use cases for multimodal interaction and presents them in terms of varying device capabilities and the events needed by each use case to couple different components of a multimodal application.

Status of this Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. The latest status of this document series is maintained at the W3C.

W3C's Multimodal Interaction Activity is developing specifications for extending the Web to support multiple modes of interaction. This document describes several use cases as the basis for gaining a better understanding of the requirements for multimodal interaction, and the kinds of information flows needed for multimodal applications.

This document has been produced as part of the W3C Multimodal Interaction Activity, following the procedures set out for the W3C Process. The authors of this document are members of the Multimodal Interaction Working Group (W3C Members only). This is a Royalty Free Working Group, as described in W3C's Current Patent Practice NOTE. Working Group participants are required to provide patent disclosures.

Please send comments about this document to the public mailing list: www-multimodal@w3.org (public archives). To subscribe, send an email to <www-multimodal-request@w3.org> with the word subscribe in the subject line (include the word unsubscribe if you want to unsubscribe).

A list of current W3C Recommendations and other technical documents including Working Drafts and Notes can be found at http://www.w3.org/TR/.

1. Introduction

Analysis of use cases provides insight into the requirements for applications likely to require a multimodal infrastructure.

The use cases described below were selected for analysis in order to highlight different requirements resulting from application variations in areas such as device requirements, event handling, network dependencies and methods of user interaction

It should be noted that although the results of this analysis we be used as input to the Multimodal Specification being developed by the W3C Multimodal Interaction Working Group, there is no guarantee that all of these applications will be implementable using the language defined in the specification.

1.1 Use Case Device Classification

Thin Client

A device with little processing power and capabilities that can be used to capture user input (microphone, touch display, stylus, etc) as well as non-user input such as GPS. The device may have a very limited capability to interpret the input, for example a small vocabulary speech recognition, or a character recognizer. The bulk of the processing occurs on the server including natural language processing and dialog management.

An example of such a device may be a mobile phone with DSR capabilities and a visual browser (there could actually be thinner clients than this).

Thick Client

A device with powerful processing capabilities, such that most of the processing can occur locally. Such a device is capable of input capture and interpretation. For example, the device can have a medium vocabulary speech recognizer, a handwriting recognizer, natural language processing and dialog management capabilities. The data itself may still be stored on the server.

An example of such a device may be a recent production PDA or an in-car system.

Medium Client

A device capable of input capture and some degree of interpretation. The processing is distributed in a client/server or a multidevice architecture. For example, a medium client will have the voice recognition capabities to handle small vocabulary command and control tasks but connects to a voice server for more advanced dialog tasks.

1.2 Use Case Summaries

Table 1: Form Filling for air travel reservation

Description Device Classification Device Details Execution Model
The means for a user to reserve a flight using a wireless personal mobile device and a combination of input and output modalities. The dialogue between the user and the application is directed through the use of a form-filling paradigm. Thin and medium clients touch-enabled display (i.e., supports pen input), voice input, local ASR and Distributed Speech Recognition Framework, local handwriting recognition, voice output, TTS, GPS, wireless connectivity, roaming between various networks. Client Side Execution
Scenario Details

User wants to make a flight reservation with his mobile device while he is on the way to work. The user initiates the service via means of making a phone call to a multimodal service (telephone metaphore) or by selecting an application (portal environment metaphore). The details are not described here.

As the user moves between networks with very different characteristics, the user is offered the flexibility to interact using the preferred and most appropriate modes for the situation. For example, while sitting in a train, the use of stylus and handwriting can achieve higher accuracy than speech (due to surrounding noise) and protect privacy. When the user is walking, the input and output modalities that more appropriate would be voice with some visual output. Finally, at the office the user can use pen and voice in a synergistic way.

The dialogue between the user and the application is driven by a form-filling paradigm where the user provides input to fields such as "Travel Origin:", "Travel Destination:", "Leaving on date", "Returning on date". As the user selects each field in the application to enter information, the corresponding input constraints are activated to drive the recognition and interpretation of the user input. The capability of providing composite multimodal input is also examined, where input from multiple modalities is combined for the interpretation of the user's intent.

Table 2: Driving Directions

Description Device Classification Device Details Execution Model
This application provides a mechanism for a user to request and receive driving directions via speech and graphical input and output Medium Client on-board system (in a car) with a graphical display, map database, touch screen, voice and touch input, speech output, local ASR and TTS Processing and GPS. Client Side Execution
Scenario Details

User wants to go to a specific address from his current location and while driving wants to take a detour to a local restaurant (The user does not know the restaurant address nor the name). The user initiates service via a button on his steering wheel and interacts with the system via the touch screen and speech.

Table 3: Name Dialing

Description Device Classification Device Details Execution Model

The means for users to call someone by saying their name.

thin and fat devices

Telephone

The study covers several possibilities:

  • whether the application runs in the device or the server
  • whether the device supports limited local speech recognition

These choices determine the kinds of events that are needed to coordinate the device and network based services.

Scenario Details

Janet presses a button on her multimodal phone and says one of the following commands:

The application initially looks for a match in Janet's personal contact list and if no match is found then proceeds to look in other directories. Directed dialog and tapered help are used to narrow down the search, using aural and visual prompts. Janet is able to respond by pressing buttons, or tapping with a stylus, or by using her voice.

Once a selection has been made, rules defined by Wendy are used to determine how the call should be handled. Janet may see a picture of Wendy along with a personalized message (aural and visual) that Wendy has left for her. Call handling may depend on the time of day, the location and status of the both parties, and the relationship between them. An "ex" might be told to never call again, while Janet might be told that Wendy will be free in half an hour after Wendy's meeting has finished. The call may be automatically directed to Wendy's home, office or mobile phone, or Janet may be invited to leave a message.

2. Use Case Details

2.1 Use-case: Form filling for air travel reservation

Description: The air travel reservation use case describes a scenario in which the user books a flight using a wireless personal mobile device and a combination of input and output modalities.

The device has a touch-enabled display (i.e., supports pen input) and it is voice enabled. The use case describes a rich multimodal interaction model that allows the user to start a session while commuting on the train, continue the interaction while walking to his office and complete the transaction while sat at his office-desk. As the user moves between environments with very different characteristics, the user is given the opportunity to interact using the preferred and most appropriate modes for the situation. For example, while sitting in a train, the use of stylus and handwriting can offer higher accuracy than speech (due to noise) and protect privacy. When the user is walking, the input and output modalities more appropriate would be voice with some visual output. Finally, at the office the user can use pen and voice in a synergistic way.

This example assumes the seamless transition through a variety of connectivity options such as high bandwidth LAN at the office (i.e., 802.11), lower bandwidth while walking (i.e., cellular network such as GPRS) and low bandwidth but in addition intermittent connectivity while on the train (e.g., can get disconnected when going through a tunnel). The scenario also takes advantage of network services such as location and time.

Actors

Additional Assumptions

Table 4: Event Table

User Action Action on device Events sent from device Action on server Events sent From server

Device turned on

Registers with network and uploads delivery context [available I/O modalities, bandwidth, user-specific info (e.g., home city)]

register_device (delivery_context)

Complete session initiation by registering device and delivery context (init_session)

register_ack

User picks travel app (taps with stylus or says travel)

Client side of application is started

app_connect (app_name)

Loads a page that is appropriate to current profile

app_connect_ack (start_page)

Application is running and ready to take input. Origin city was guessed from user profile or location service. User is o the train. Active I/O modalities are pen, display and audio output.

User picks a field in the form to interact with the stylus

Destination field gets highlighted

on_focus (field_name)

Server loads the appropriate constraints for input on this field. Constraints are sent to device for hwr.

listen_ack (field_grammar)

User starts writing. When he is finished

Handwriting recognition performed locally with visual and audio presentation of result (i.e., earcon)

 

 

 

If recognition confidence is low, a different earcon is played and pop-up menu of top-n hypotheses is displayed.

User approves result by moving to next field with stylus (e.g., departure time)

Result is submitted to server.

 

Time field is highlighted.

submit_partial (destination)

on_focus (field_name)

Dialog state is updated. Appropriate constraints for input on this field are loaded. Grammar constraints are sent to the device

listen_ack (field_grammar)

User gets off the train and starts walking - I/O modality is voice only

User explicitly switches profile via button press, or through non-user sensory input the profile is changed

Profile update - only voice enabled input with voice and visual output

update (delivery_context)

Speech recognition and output module initialization. Synchronization of dialog state between modalities. Audio prompt "what time do you want to leave" is generated).

send (autio_prompt)

In response to audio prompt, user says "I want a flight in the morning".

Audio is collected and sent it to server through data or voice channel

send (audio)

Recognizes voice and generates list of hypotheses. Corresponding audio prompt is created (e.g., "would you like to flight at 10 or 11 in the morning").

send (audio_prompt)

While walking, field selection is either driven by the dialog engine on the server, or by the user uttering simple phrases (e.g., voice graffiti)

User reaches his office.

User explicitly switches profile via button press, or through non-user sensory input the profile is changed.

Events an handlers as previously for changing the delivery context to accommodate interaction via voice, pen and GUI selection

   

At this point in the dialogue, it has been determined that there are no direct flights between origin and destination. The application displays available routes with in-between stops on a map and the user is prompted to select one.

User says "I would like to take this one" while making a pen gesture (i.e., circling over the preferred route)

Ink and audio are collected and sent to the server with time stamp information.

send (audio)

send (ink)

Server receives the two inputs and integrates them into a semantic representation

Server updates app with selection, acknowledging that input integration was possible.

completeAck

At this point in the dialog, payment authorization needs to be made. User enters credit card information via voice, pen or keypad.

User provides signature for authorization purposes

Ink is collected with information about pressure and tilt.

send (ink)

Server verifies signature.

DONE

2.2 Use-case: Driving Directions

Assumptions

Actors

Primary Device:

Data sources:

Scenario Walkthrough (User point of view)

User preferences (These may be changed on a per session basis):

User wants to go to a specific address from his current location and while driving wants to take a detour to a local restaurant (The user does not know the restaurant address nor the name)

Table 5: Event Table

User Action/External Input Action on Device Event Description Event Handler Resulting Action
User presses button on steering wheel Service is initiated and GPS satellite detection begins HTTP Request to app server App server returns initial page to device Welcome prompts are played. Authentication dialog is initiated (may be initiated via speaker identification or key identification).
User interacts in an authentication dialog Device executes authentication dialog using local ASR processing HTTP Request to app server which includes user credentials App server returns initial page to device including user preferences User is prompted for a destination (if additional services are availble after authentication, assume that user selects driving direction application)
Initial GPS Input N/A GPS_Data_In Event Device handles location information Device updates map on graphical display (assumes all maps are stored locally on device)
User selects option to change volume of on-board unit using touch display. N/A Touch_screen_event (includes x, y coordinates) Touch screen detects and processes input Volume indicator changes on screen. Volume of speech output is changed
User presses button on steering wheel Device initiates connection to ASR server Start_Listening Event ASR Server receives request and establishes connection "listening" icon appears on display (utterances prior to establishing the connection are buffered)
User says destination address (may improve recognition accuracy by sending grammar constraints to server based on a local dialog with the user instead of allowing any address from the start) N/A N/A ASR Server processes speech and returns results to device Device processes results and plays confirmation dialog to user while highlighting destination and route on graphical display
User confirms destination Device performs ASR Processing locally. Upon confirmation, destination info is sent to app server HTTP Request is sent to app server (includes current location and destination information) App Server processes input and returns data to device Device processes results and updates graphical display with route and directions highlighting next step
GPS Input at regular intervals N/A GPS_Data_In Event Device processes location data and checks if location milestone is hit Device updates map on graphical display (assumes all maps are stored locally on device) and highlights current step. When milestone is hit, next instruction is played to user
GPS Input at regular intervals (indicating driver is off course) N/A GPS_Data_In Event Device processes location data and determines that user is off course Map on graphical display is updated and textual message is displayed indicating that route is not correct. Prompt is played from the device indicating that route is being recalculated
N/A Route request is sent to app server including new location data HTTP Request is sent to app server (includes current location and destination information) App Server processes input and returns data to device Device processes results and updates graphical display with route and directions highlighting next step
Alert received on device based on traffic conditions N/A Route_Change Alert Device processes event and initiates dialog to determine if route should be recalculated User is informed of traffic conditions and asked whether route should be recalculated.
User requests recalculation of route based on current traffic conditions Device performs ASR Processing locally. Upon confirmation, destination info is sent to app server HTTP Request is sent to app server (includes current location and destination information) App Server processes input and returns data to device Device processes results and updates graphical display with route and directions highlighting next step
GPS Input at regular intervals N/A GPS_Data_In Event Device processes location data and checks if location milestone is hit Device updates map on graphical display (assumes all maps are stored locally on device) and highlights current step. When milestone is hit, next instruction is played to user
User presses button on steering wheel Connection to ASR server is established Start_Listening Event ASR Server receives request and establishes connection User hears acknowledgement prompt for continuation, and "listening" icon appears on display
User requests new destination by destination type while still depressing button on steering wheel (may improve recognition accuracy by sending grammar constraints to server based on a local dialog with the user) N/A N/A ASR Server processes speech and returns results to device Device processes results and plays confirmation dialog to user while highlighting destination and route on graphical display
User confirms destination via a multiple interaction dialog to determine exact destination Device executes dialog based on user responses (using local ASR Processing) and accesses app server as needed HTTP requests to app server for dialog and data specific to user response App server responds with appropriate dialog User interacts in a dialog and selects destination. User is asked whether this is a new destination
User indicates that this is a stop on the way to original destination Devices sends updated destination information to app server HTTP Request for updated directions (based on current location, detour destination, and ultimate destination) App Server processes input and returns data to device Device processes results and updates graphical display with new route and directions highlighting next step
GPS Input at regular intervals N/A GPS_Data_In Event Device processes location data and checks if location milestone is hit Device updates map on graphical display (assumes all maps are stored locally on device) and highlights current step. When milestone is hit, next instruction is played to user

Protocols:

Events:

Synchronization Issues:

Latency Concerns

Scenario Considerations

Input Information:

Possible Devices:

Available Technologies:

Data sources:

Output Mechanisms:

2.3 Use Case: Multimodal Name Dialling Use Case

Overview

The Name Dialing use case describes a scenario in which users can say a name into their mobile terminals and be connected to the named person based on the called party's availability for that caller.

If the called user is not available, the calling user may be given the choice of either leaving a message on the called user's voicemail system or sending an email to the called user. The called user may provide a personalized message for the caller, including, for example, "Don't ever call me again!"

The called user is given the opportunity of selecting which device the call should be routed to, e.g. work, mobile, home, or voice mail. This may be dependent on the time of day, the called user's location, and the identity of the calling user.

The use case assumes a rich model of name dialling as an example of a premium service exploiting a range of information such as personal and network directories, location, presence, buddy lists and personalization features.

The benefits of making this a multimodal interacton include the ability to view and listen to information about the called user, and to be able to use a keypad or stylus, as an alternative to using voice as part of the name selection process.

Actors

Assumptions

The user has a device with a button that is pushed to place a call. The device has recording capabilities. [voice activation is power hungry and unreliable in noisy environments]

Both voice and data capabilities are available on the communications provider's network (not necessarily as simultaneously active modes).

If the phone supports speech recognition and there is a local copy of the personal phone contact list, then the user's spoken input is first recognized against the local directory for a possible match and if unsuccessful, the request is extended back to the directory provider.

The directory provider has access to a messaging service and to user profiles and presence information. The directory provider thus knows the whereabouts of each registered user - on the phone, at work, unavailable etc.

The directory provider enforces access control rules to ensure individual and corporate privacy. This isn't explored in this use case.

People can be identified by personal names like "Wendy" or by nick names or aliases. The personal contact list provides a means for subscribers to define their own aliases, and to limit the scope of search (there are a lot of Wendy's worldwide).

There is a user agent on the client device with an XHTML browser and optional speaker-dependent speech recognition capabilities.

There is a client server relationship between the user agent on the device and the directory provider.

The dialog could be driven from either the client device or from the network. This doesn't effect the user view, but does alter the events used to coordinate the two systems. This will be explored in a later section.

The Name Dialing use case will be described through the following views:

User view

User pushes a button and says

  "Call Wendy Smith"

It is also possible to say such things as:

  "Call Wendy"

  "Call Wendy Smith at work".

  "Call Wendy at home".

  "Call Wendy Smith on her mobile phone".

Multiple scenarios are possible here:

If local recognition is supported, the utterance will be first processed by a local name dialling application. If there is no match, the recorded utterance is forwarded to a network based name dialling application.

The user's personal contact list will take priority over the corporate and public directories. This is independent of whether the personal list is held locally in the device or in the network.

The following situations can arise when the user says a name:

  1. Single match — the caller is presented with information about the callee. This may include a picture taken from the callee's profile. The caller is asked for a confirmation before the call is put through.

  2. Multiple matches — if the number of matches is small (perhaps five or less), the caller is asked to choose from the list. This is be presented to the caller via speech and accompanied with a display of a list of names and pictures. The caller can then:

    A further alternative is to say "that one" as the system speaks each item in the list in sequence. This method is offered in case the user needs hands and eyes free operation, or the device is incapable of displaying the list

  3. Lot's of matches, for example, when the caller says a common name. The caller is led through a directed dialog to narrow down the search.

  4. No recognition — the recognizer wasn't able to find a match. The user could have failed to say anything, or there could have been too much noise. A tapered help mechanism is invoked. Callers could be asked to repeat themselves, or asked to key in the number or speak it digit by digit.

Assuming that the user successfully makes a selection:

The availability of the called user may depend on the time of day, whether the called user is away from her work or home location, and who the calling user is. For example, when travelling you may want to take calls on your mobile during the day. Don't you hate it when people call you in the middle of the night because they don't realize what timezone you are in! You may want to make an exception for close friends and family members. There may also be some people whom you never want to accept calls from, not even voice messages!

When a user is notified of an incoming call, the device may present information on the caller including a photograph, name, sound bite, location and local time information, depending on the relationship between the caller and callee. The user then has an opportunity to accept the call or to divert it to voice mail.

Directory provider View

What is driving the dialog?

The details of the events depend on whether the dialog is being driven from the network or from the user device.

When the device sends a spoken utterance to the server, the user may have spoken a name such as "Tom Smith" or spoken a command such as "the last one". If the directory search is being driven by the user device, the server's response is likely to be a short list of matches, or a command or error code. To support the application, the server would provide a suite of functions, including the means for the device to set the recognition context, the ability to play specific prompts, and to download information on named users.

If the network is driving the dialog, the device sends the spoken utterance in the same way, but the responses are actions to update the display and local state. If the caller presses a button or uses a stylus to make a selection, this event will be sent to the server. The device and server could exchange low level events, such as a stylus tap at a given coordinate, or higher level events such as which name the user has selected from the list.

Table 6: Event Table

User action

Action on device

Events sent from device

Action on server

Events sent from server

Turns on the device

Registers with the Directory Provider through the operator in the NW and downloads the personal directory

register user (userId)

Directory Provider gets register information, updates user's presence and location info, loads user's personal info (buddy list, personal directory,...)

acknowledgement + personal directory

In practice, SyncML would be used to reduce net traffic

Pushes a button to place a call

Local reco initialized, activates the personal directory

     
 

Displays a prompt

"Please say a name"

     

Speaks a name

Local recognition against personal directory

     

a) If grammar matches:

 

Display the name or namelist (see following table)

     

Confirms by pressing the call button again if 1 name is displayed, or selects a name on the list (see following table)

Fetches the number from the personal directory

call(userID, number)

Checks the location and presence status of the called party

call ok(picture)
OR
called party not available

 

if call ok, displays the picture and places a call,

if called party not available, displays/plays a corresponding prompt about leaving a message or sending an e-mail

     

i) if user chooses to leave a message:

User agrees to leave a message by pressing a suitable button

Initializes the recording, displays a prompt to start the recording

     

User speaks and ends by pressing a suitable button

Closes the recording, sends the recording to the Directory Provider app

leave message(userID, number, recording)

Stores the message for the called party

message ok

ii) if user chooses to send an e-mail:

User selects 'send e-mail' option by pressing a suitable button

Starts an e-mail writing application

     

Writes e-mail

Fetches the e-mail address from the personal directory, sends e-mail, closes the e-mail app

send mail(userID, mail address, text)

Sends the e-mail to the called party

mail ok

b) if personal grammar does not match:

 

sends the utterance to be recognized in the network

send(userID, utterance)

Recognition against public directory

reco ok(namelist)

OR

reco nok

 

if reco ok, displays the name or namelist (more details in following table), activates local reco with the index list if more than one name,

if reco nok, display/play a message to the user

     

Confirms by pressing the call button again if 1 name is displayed, or selects a name on the list (see following table)

Selection received (perhaps spoken index recognized first)

call(userID, number)

Checks the location ... [continues as described above]

 

Table 7: Interaction details of displaying and confirming the recognition results

User action

Action on device

Events sent from device

Action on server

Events sent from server

... speaker utterance has been processed by the recogniser

A. Very high confidence, unique match, auto confirmation (NB! I would recommend letting the user confirm this explicitly; this would also make the application behaviour seem more consistent to the user since some kind of confirmation would be needed every time)

 

Displays the name and shows/plays clear prompt "Calling ..."

     
 

Fetches the number

call(userID, number)

Checks the location and presence status of the called party

call ok(picture)

OR

called party not available

B. High confidence, unique match, explicit confirmation

 

Displays the name and picture, prompt asking "Place a call?"

     

Confirms by pressing the call button again

Fetches the number

call(userID, number)

Checks the location and presence status of the called party

call ok(picture)

OR

called party not available

C. High confidence with several matching entries, or medium confidence with either unique match or several matching entries

 

Displays the namelist with indexes, activates index grammar on local reco; if multiple entries with same spelling, additional info should be added on the list

     

Selects a name by speaking the index or navigating to the correct name with keypad and pressing the call button

Fetches the number

call(userID, number)

Checks the location and presence status of the called party

call ok(picture)

OR

called party not available

D. Low confidence, no match from the directory/ies

 

Prompts "Not found, please try again"

     

User speaks the name again

New recognition, on 2nd or 3rd 'nomatch', change the prompt to ~ "Sorry, no number found"

     

Table 8: No local recognition, all recognition in the Network

User action

Action on device

Events sent from device

Action on server

Events sent from server

Turns on the device

Registers with the Directory Provider through the operator in the NW

register user(userID)

Directory Provider gets register information, updates user's presence and location info, loads user's personal info (buddy list, personal directory,...)

register ack

Pushes a button to place a call

 

init reco(userID)

Activates the personal directory and public directory

reco init ok

 

Displays a prompt

"Please say a name"

     

Speaks a name

Sends the utterance to be recognized in the network

send(userID, utterance)

Recognition against personal directory first, if no match there with confidence greater than some threshold, then against public directory

reco ok(namelist)

OR

reco nok

3. Acknowledgements

The following people contributed to this document: