VoiceXML – WAP integration issues

Jim Larson
Intel Architecture Lab

Mail Station JF3-377
2111 N.E. 25th Avenue
Hillsboro, OR 97124-5961

Jim.A.Larson@intel.com
Telephone +1 (503) 264-8463

Introduction: Why integrate WAP and VoiceXML?

WAP presents a visual user interface, yet speech is the natural mode of phone use. VoiceXML enables users to speak and hear on a phone, but does not support a visual interface. It would be very convenient for users to speak and listen to the web in a manner similar to the way in which users speak and listen to each other, using a combination of visual and verbal interfaces.

This workshop should identify issues and problems with the integration of VoiceXML and WAP. The workshop should recommend a plan for resolving these issues. The following enumerates three candidate issues:

  1. Architectural – How to integrate WAP and VoiceXML architectures?

  2. User Interface – How to integrate verbal and visual user interfaces?

  3. Markup languages – How to integrate the WAP and VoiceXML Markup languages?

1. Architecture — How to integrate WAP and VoiceXML architectures?

In the WAP architecture, WAP documents are stored on a document server. The WAP browser, which resides on the client, fetches WAP documents from the document server and interprets them, displaying text on the WAP display on the client.

In the VoiceXML architecture, VoiceXML documents are stored on a document server. The VoiceXML browser, which resides on a speech server, fetches VoiceXML documents from the document server and interprets them. Voice is transmitted to the cleint.

The major difference: WAP documents execute on the client itself, while VoiceXML documents execute on a separate speech server. Where should the VoiceXML and WAP commands be executed? I see three major approaches:

Approach 1.1. Execute both VoiceXML and WAP commands on the client.

In this approach, the client processor is powerful enough to process both VoiceXML and WAP commands, including the speech recognition and text-to-speech processing required by VoiceXML. This requires a more expensive processor and more memory than a traditional WAP processor. However, the client can perform useful processing when disconnected from the server.

Approach 1.2. Execute both Voice XML and WAP commands on the server.

The client is nothing more than a phone with a display. The client can not perform any processing when disconnected from the server.

Approach 1.3. Execute all WAP commands and "voice extraction" on the client. "Classification and the VoiceXML commands are executed on the server.

The client contains a speech feature extractor and a limited feature classifier. Enabling it to execute voice-enabled versions of private applications, of the type often available on a PDA. For example, application switching, shopping lists, telephone lists, currency conversion, calorie counting, etc. The server contains a powerful feature classifier. The client extracts features from dictated speech, stores them until the client is connected to the server, and then the server completes the dictation process.

2. User Interface

There are at least two ways to integrate the use of verbal and visual user interfaces:

Approach 2.1. The client supports parallel verbal and visual user interfaces

Users should be able to use the client either as a telephone (and ignore the visual user interface) or as a PDA and (and ignore the verbal user interface).

Approach 2.2. The client supports an integrated verbal and visual user interface.

User may speak and press buttons to enter a single request. For example, the user hears a verbal prompt but selects from among visual options by pressing a button. As another example, the user Selects from among visual options by pressing a button and then speaks the value of a parameter.

3. Markup Language

The following illustrates one possible approach to integrating VoiceXML and WAP. Other approaches are certainly possible.

<menu>
  <prompt> 
    <verbal>Welcome to Ajax Travel. <question>Do
      <emphasize> you </emphasize> want to fly to
      New York, Boston, or Washington DC </question>
    </verbal>  

    <visual> Welcome to Ajax Travel, where do you
    want to fly? </visual>  
  </prompt>

  <choice next="http://www.NY…".>
    <verbal>New York | The Big Apple </verbal>   
    <visual> 1. New York </visual>   
  </choice>

  <choice next="http://www.Boston...">
    <verbal> Boston | Beantown </verbal>   
    <visual> 2. Boston </visual>   
  </choice>

  <choice next="http://www.Wash.">
    <verbal>
      Washington D.C. | Washington | The U.S. Capital
    </verbal>   
    <visual> 3. Washington D.C. </visual>   
  </choice>
</menu>

Markup language syntax issues include the following:

  1. Should the syntax be similar to VoiceXML or WAP?

  2. Is it possible to automatically generate a default visual component from the verbal component?

  3. Is it possible to automatically generate a default verbal component from the visual component?

User interface guidelines issues include the following: 

  1. How different or similar should the wording of the visual and verbal user interfaces be?

  2. Should error messages, secondary prompts, and feedback be presented verbally, visually, or both?

A systematic, universal approach to the integration of VoiceXML and WAP is preferred over multiple, ad hoc approaches which will occur if the industry fails to agree to work together. This workshop is the first step towards this goal