VoiceXML and WML Convergence

Introduction

PipeBeach, founded in 1998, is a European leader in the intersection of three major trends: internet, mobile communications and speech user interfaces. PipeBeach has developed the speechWeb® product, an HTML/VoiceXML platform for speech access to mobile internet applications such as email reading, voice portals, m-commerce, customer care and directory services.

PipeBeach is committed to developing products based on open standards and protocols so as to increase and consolidate the market for voice access to web content and services. SpeechWeb® supports the current W3C standards such as HTML 4.0 and Document Object Model (DOM) Level 1, and also supports VoiceXML 1.0. Furthermore, PipeBeach is an active member of the W3C Voice Browser Working Group, working with the development of the W3C speech grammar format, and Scott McGlashan leads the dialog team specifying a W3C dialog markup language (modeled on VoiceXML 1.0 as submitted to W3C for standardization).

From Vision to Requirements

Our vision for the mobile users is to be able to access internet content using both sound and vision. They will be able to request information by voice and see that information displayed on their mobile device. Likewise, they will be able to use their device to select services and hear the content through synthesized speech, or through playback of audio files or streams. Through their mobile devices, users will have a multimodal interface to internet services. Adaptivity will be critical in a multimodal interface: the structure, and possibly content, of services need to be dynamically 'fitted' to the use context, where that context includes user preferences and customization, device capability, and so on.

To reach this vision, we need to address many needs across a wide range of technical and business domains. One of the key requirements which could be addressed by W3C and the WAP Forum is markup language convergence. At present, there are a variety of markup languages targeted at specific interface media: (X)HTML for desktop browsers, WML/CHTML for mobile device displays, and VoiceXML for speech interfaces. To progress, we need to address issues such as:

Can the languages converge by the XHTML modularization approach?
What are the contents of specific XHTML modules in this scenario? Which modules comes from (which parts of) VoiceXML? Which from WML? How will these modules interact?
What impact will GPRS/3G have on WAP? Are some WML features (e.g. 'deck' of cards) which differentiation it from HTML still relevant?
How will an XHTML modularization approach affect service creation? Our worry is that the cost of integrating multimodality in a single markup language may be complexity in service creation. Effective service development tools will be even more critical than those for single media access.
The current markup languages do not make a clear distinction between content and presentation. How can this be addressed using modularization?
Is it desirable to develop in one step a single markup language for all devices and media, even if modularized? Is it possible to add other media features to each of these languages? For example, adding speech dialog modules into WML? Or adding display tags into VoiceXML?

Expected Outcome of Workshop

We hope that the workshop will minimally provide participants with a clear overview of activities in related domains, including

Voice Browser Working Group progress in specifying markup languages for recognition grammars, speech synthesis, dialog and how even these languages will interoperate. Progress in standardization of VoiceXML. Identification of key features of speech dialog within the VoiceXML specification which could contribute to one or more XHTML modules.
WAP Forum strategy on merging WML with XHTML modularization. What is the added value of WML over a subset of HTML similar to CHTML?

More specifically, we would like to see the establishment of a Working Group with the mission to drive the convergence of WML and VoiceXML so as develop a markup language which supports at least a speech, keypad and graphics multimodal interface for the mobile internet access. The group could be composed member from the Voice Browser Working Group, and the WAP Forum, as well as invited experts.

Contributions to Discussions

In this section, we outline some of the contributions we can make to the workshop.

Voice-Enabling WAP Sites by Transcoding WML to VoiceXML

Many of our customers are mobile operators or ISPs with existing web and/or WAP portals which they want to make speech accessible for mobile phone users. In some cases, they are able to generate markup language which can be rendered on our platform; for example, the site produces VoiceXML directly. In other cases, however, they want us to provide speech access to their WAP portals; the reasons can include time-to-market and a need to minimize infrastructure costs. This has given us considerable experience in developing techniques for transforming WML into VoiceXML.

Major differences we have encountered can be classified into design patterns and specific language features:

Design Patterns: patterns appropriate to one media can be inappropriate in another media. Take the example of services which contain a number of news stories, each story having a headline and body, and these stories are organized into a number of categories. The WAP design pattern (the 'hyperlink' pattern) to access this service is typically one page with links for the category names, where each category name links to a page with links for story headlines, which in turn link to pages for the body of the stories. Following this pattern in VoiceXML can be awkward for the user if links are treated in a consistent manner; for example, if we allow the content of category links to be recognized, it may be problematic to allow the content of headline links to be recognized also. A more reasonable approach in VoiceXML is for the categories to be presented as a menu, where the user can speak the category, and headlines to be presented sequentially where the user can say commands to hear the story body, or continue onto the next headline. Thus the same data is used in both interfaces, but different patterns are used to present (and navigate) the content.
Language Features: the tagsets of WML and VoiceXML can be seen as partly overlapping, extended subsets of HTML. WML services tend to use short names and abbreviations. These can be difficult to render well using speech synthesis. If text is used to build a dynamic speech grammar, they can also provide less than satisfactory recognition results. Further information also needs to be added to WML for other input types which have no obvious correspondence to those in VoiceXML; for example, a 'free text' input requesting a company name on the NYSE. Finally, information needs to be added to deal with confirmation and clarification of speech input.

Closer examination of these differences is useful in determining how the markup languages can converge within an XHTML module approach. We have addressed these differences by separating between data and content, and using XSL stylesheets for transformations. In the first stage, the data from a WML page is extracted by a site-specific XSL transformation into an XML data structure appropriate to the application. This stage can be skipped if the site is able to directly provide the data in this format. In the second stage, another XSL stylesheet is used to transform the data structure into an VoiceXML document; at this stage, we can expand abbreviations, etc, and add additional information such as speech grammars, confirmation handling, etc.

While the transcoding approach has the advantage that it can provide speech access to WAP service using today's standards and technology, it is clearly unsuitable as a general approach: transforming WML pages into data structures is obviously content specific. The more 'correct' model, as used in the second stage, is one where we start with a data model and use XSL stylesheets to specify the presentation layer for each media.

Synchronization of WAP and VoiceXML Browsers

Using today's GSM technology there is no obvious way to provide a multimodal interface with simultaneous voice and display characteristics to mobile phone users. Two developments will change that.

micro-browsers with speech capability As portable devices become more powerful, have greater memory and longer battery life, they will be able to have embedded speech recognizers and synthesizers. While they may not have the capability of network-based component, they will have the advantage that they can be 'adapted' to the user: the recognizer can be speaker-adaptive, the synthetic voice can be personalized, and so on.
GPRS/3G Networks The next generation of mobile networks will provide greater bandwidth and faster connectivity. They will also introduce the possibility of simultaneous voice and data channels: e.g. markup can be transferred between client and server while the user is talking. This allows for a spectrum of hybrid architectures for multimodal browsing where, for example, speech processing resources and/or voice browsers can be positioned in the network, and synchronized with a micro-browser on the device.

While both developments properly require a multimodal markup language, we have some experience of simulating the second development using a 'co-browser' approach with a VoiceXML browser in the network and a WAP browser in the phone.

In our simulation, the service is defined in terms of a sequence of XML data structures, together with XSL stylesheets for mapping these into VoiceXML and WML respectively: each WAP screen corresponds to a dialog state, and vice versa. The browser are synchronized with each other by means of a synchronization data-structure which specifies URI correspondences between the markup documents; for example, if the WAP browser is at /service/sports.wml, then the voice browser should be at /services/sports.vxml. Such a synchronization is 'loose' in the sense the browsers have no knowledge of each other (so no synergic multimodal effects --- 'select that link') and their states are typically only synchronized when they callback to server. In order to perform synchronization, both the VoiceXML platform and WML browser need to be pushed 'change of location' instructions when the users has acted through the other browser: for example, if the user selects a headline using the WAP browser and this causes the WAP browser to load a page describing the news story, then the VoiceXML browser needs to be instructed to load the corresponding VoiceXML page. In wap 1.2, the wap browser can be instructed to change location using its push mechanism. For VoiceXML, our platform can be instructed to load another page, thereby canceling the current dialog.

The overall effect is that user is able to experience a multimodal interface to services. By saying a command like 'stock prices', the relevant stock prices are simultaneously presented on the mobile display and read out. While our co-browser is only a simulation, it does demonstrate that at least simple multimodal interfaces can be built for the next generation of networks using VoiceXML and WAP browsers with 'push' capability, and a service design with a clear separation between data and presentation. One nice feature is the different media services can be created and used independently of each other --- the only connection is in the synchronization data structure.

Dialog Markup Language: Status of Requirements and Requirements

The W3C Voice Browser WG has been developing requirements and specifications for a Speech Interface Framework. One part of the activity has been on speech dialog with DTMF and speech input, audio and speech output (i.e. not multimodal dialog specifically).

Our activity began by formulating requirements on a speech dialog specification language called 'DialogML'. The requirements document was published in December 1999. In May 2000, W3C accepted VoiceXML 1.0 submission as a basis for the DialogML specification. Our work on this specification falls into a three main categories:

Integration of Synthesis, Grammar and NL Semantics Specifications Speech Synthesis ML is a markup language for speech synthesizers, grammar ML contains an XML (and BNF) specification of speech recognition grammars, and NL Semantics ML describes the results of speech recognition, including multiple recognition results, ambiguity and confidence scores. Their integration will strengthen dialogML's handling of speech input and output.
Change Requests We are currently evaluating requests for other change in VoiceXML which should be incorporated into DialogML.
Modularization Investigation We are investigating how DialogML could be modularized as in XHTML. The basic building block of spoken dialog markup language focus on a initiative-response-evaluation model of interaction: namely, initiative output (synthetic or audio elements), response (speech or DTMF), and evaluation (script to determine what to do next). These could be the content of the spoken dialog modules in an integrated markup language.