VoiceXML and WML Convergence

Scott McGlashan
PipeBeach AB
Stockholm, Sweden
Scott.McGlashan@pipebeach.com

Introduction

PipeBeach, founded in 1998, is a European leader in the intersection of three major trends: internet, mobile communications and speech user interfaces. PipeBeach has developed the speechWeb® product, an HTML/VoiceXML platform for speech access to mobile internet applications such as email reading, voice portals, m-commerce, customer care and directory services.

PipeBeach is committed to developing products based on open standards and protocols so as to increase and consolidate the market for voice access to web content and services. SpeechWeb® supports the current W3C standards such as HTML 4.0 and Document Object Model (DOM) Level 1, and also supports VoiceXML 1.0. Furthermore, PipeBeach is an active member of the W3C Voice Browser Working Group, working with the development of the W3C speech grammar format, and Scott McGlashan leads the dialog team specifying a W3C dialog markup language (modeled on VoiceXML 1.0 as submitted to W3C for standardization).

From Vision to Requirements

Our vision for the mobile users is to be able to access internet content using both sound and vision. They will be able to request information by voice and see that information displayed on their mobile device. Likewise, they will be able to use their device to select services and hear the content through synthesized speech, or through playback of audio files or streams. Through their mobile devices, users will have a multimodal interface to internet services. Adaptivity will be critical in a multimodal interface: the structure, and possibly content, of services need to be dynamically 'fitted' to the use context, where that context includes user preferences and customization, device capability, and so on.

To reach this vision, we need to address many needs across a wide range of technical and business domains. One of the key requirements which could be addressed by W3C and the WAP Forum is markup language convergence. At present, there are a variety of markup languages targeted at specific interface media: (X)HTML for desktop browsers, WML/CHTML for mobile device displays, and VoiceXML for speech interfaces. To progress, we need to address issues such as:

  1. Can the languages converge by the XHTML modularization approach?
  2. What are the contents of specific XHTML modules in this scenario? Which modules comes from (which parts of) VoiceXML? Which from WML? How will these modules interact?
  3. What impact will GPRS/3G have on WAP? Are some WML features (e.g. 'deck' of cards) which differentiation it from HTML still relevant?
  4. How will an XHTML modularization approach affect service creation? Our worry is that the cost of integrating multimodality in a single markup language may be complexity in service creation. Effective service development tools will be even more critical than those for single media access.
  5. The current markup languages do not make a clear distinction between content and presentation. How can this be addressed using modularization?
  6. Is it desirable to develop in one step a single markup language for all devices and media, even if modularized? Is it possible to add other media features to each of these languages? For example, adding speech dialog modules into WML? Or adding display tags into VoiceXML?


Expected Outcome of Workshop

We hope that the workshop will minimally provide participants with a clear overview of activities in related domains, including

  1. Voice Browser Working Group progress in specifying markup languages for recognition grammars, speech synthesis, dialog and how even these languages will interoperate. Progress in standardization of VoiceXML. Identification of key features of speech dialog within the VoiceXML specification which could contribute to one or more XHTML modules.
  2. WAP Forum strategy on merging WML with XHTML modularization. What is the added value of WML over a subset of HTML similar to CHTML?


More specifically, we would like to see the establishment of a Working Group with the mission to drive the convergence of WML and VoiceXML so as develop a markup language which supports at least a speech, keypad and graphics multimodal interface for the mobile internet access. The group could be composed member from the Voice Browser Working Group, and the WAP Forum, as well as invited experts.

Contributions to Discussions

In this section, we outline some of the contributions we can make to the workshop.

Voice-Enabling WAP Sites by Transcoding WML to VoiceXML

Many of our customers are mobile operators or ISPs with existing web and/or WAP portals which they want to make speech accessible for mobile phone users. In some cases, they are able to generate markup language which can be rendered on our platform; for example, the site produces VoiceXML directly. In other cases, however, they want us to provide speech access to their WAP portals; the reasons can include time-to-market and a need to minimize infrastructure costs. This has given us considerable experience in developing techniques for transforming WML into VoiceXML.

Major differences we have encountered can be classified into design patterns and specific language features:



Closer examination of these differences is useful in determining how the markup languages can converge within an XHTML module approach. We have addressed these differences by separating between data and content, and using XSL stylesheets for transformations. In the first stage, the data from a WML page is extracted by a site-specific XSL transformation into an XML data structure appropriate to the application. This stage can be skipped if the site is able to directly provide the data in this format. In the second stage, another XSL stylesheet is used to transform the data structure into an VoiceXML document; at this stage, we can expand abbreviations, etc, and add additional information such as speech grammars, confirmation handling, etc.

While the transcoding approach has the advantage that it can provide speech access to WAP service using today's standards and technology, it is clearly unsuitable as a general approach: transforming WML pages into data structures is obviously content specific. The more 'correct' model, as used in the second stage, is one where we start with a data model and use XSL stylesheets to specify the presentation layer for each media.

Synchronization of WAP and VoiceXML Browsers

Using today's GSM technology there is no obvious way to provide a multimodal interface with simultaneous voice and display characteristics to mobile phone users. Two developments will change that.

While both developments properly require a multimodal markup language, we have some experience of simulating the second development using a 'co-browser' approach with a VoiceXML browser in the network and a WAP browser in the phone.

In our simulation, the service is defined in terms of a sequence of XML data structures, together with XSL stylesheets for mapping these into VoiceXML and WML respectively: each WAP screen corresponds to a dialog state, and vice versa. The browser are synchronized with each other by means of a synchronization data-structure which specifies URI correspondences between the markup documents; for example, if the WAP browser is at /service/sports.wml, then the voice browser should be at /services/sports.vxml. Such a synchronization is 'loose' in the sense the browsers have no knowledge of each other (so no synergic multimodal effects --- 'select that link') and their states are typically only synchronized when they callback to server. In order to perform synchronization, both the VoiceXML platform and WML browser need to be pushed 'change of location' instructions when the users has acted through the other browser: for example, if the user selects a headline using the WAP browser and this causes the WAP browser to load a page describing the news story, then the VoiceXML browser needs to be instructed to load the corresponding VoiceXML page. In wap 1.2, the wap browser can be instructed to change location using its push mechanism. For VoiceXML, our platform can be instructed to load another page, thereby canceling the current dialog.

The overall effect is that user is able to experience a multimodal interface to services. By saying a command like 'stock prices', the relevant stock prices are simultaneously presented on the mobile display and read out. While our co-browser is only a simulation, it does demonstrate that at least simple multimodal interfaces can be built for the next generation of networks using VoiceXML and WAP browsers with 'push' capability, and a service design with a clear separation between data and presentation. One nice feature is the different media services can be created and used independently of each other --- the only connection is in the synchronization data structure.

Dialog Markup Language: Status of Requirements and Requirements

The W3C Voice Browser WG has been developing requirements and specifications for a Speech Interface Framework. One part of the activity has been on speech dialog with DTMF and speech input, audio and speech output (i.e. not multimodal dialog specifically).

Our activity began by formulating requirements on a speech dialog specification language called 'DialogML'. The requirements document was published in December 1999. In May 2000, W3C accepted VoiceXML 1.0 submission as a basis for the DialogML specification. Our work on this specification falls into a three main categories: