Towards Convergence of WML, XHTML, and other W3C Technologies

Ted Wugofski, Phone.com
Dave Raggett, W3C/HP

Last updated: 31st May 2000

Abstract

This letter describes some common features found or proposed in the Wireless Markup Language (WML), VoiceXML, and XForms that are not found in existing W3C technologies. It is proposed that a joint team be created under the auspices of the W3C-WAP Coordination Group to investigate these features and determine whether a common World Wide Web solution is viable.

Introduction

Where the web was once dominated by computer-type client devices driven by keyboards and large displays, there is a recent trend of providing web access on non-computer client devices using alternative input and output mechanisms. This movement to alternative devices has created new functional requirements that have driven the need for new markup languages -- WML for mobile phones [WML] and VoiceXML for voice-centric devices [VXML] are just two examples.

Some of these functional requirements are based on economic conditions that may change over time. For example, WML significantly reduced the number of elements in their markup language (compared to HTML) in order to reduce the memory requirements (which drives the cost of RAM). In many cases, these economic conditions will change and the we will see greater convergence to W3C standards.

Other functional requirements are based on the usage paradigms -- people use a mobile phone differently than a personal computer and voice presentation and interaction is different that graphical presentation and mouse input. The functional requirements that these usage paradigms drive are highly unlikely to change.

Since some of these requirements are likely to continue to exist, the question becomes: are there enough common requirements to drive new features into XHTML (or other W3C technologies)? XHTML has clearly anticipated that this might be the case and has spent the last years preparing for it through modularization.

Not only do we believe that this is the case, we have found that some new work planned by the W3C may also lead in the direction already established by WML and VoiceXML technologies. We will demonstrate this in XHTML and XForms.

There are four primary areas in which we believe there needs to be convergence to new technologies:

application contexts
dialog-style interaction models
markup based event processing
variables and expressions

As it turns out, many of these requirements are driven from the fact that the web-based content is increasingly becoming the "killer application" that drives the utility of these devices. No longer is the "user agent" the application, it is the "stuff" rendered in the user agent. In fact, in most cases the user agent is transparent and the content has edge-to-edge control of the user interaction (this is especially true in the voice space).

We will look briefly at these four new technology areas. Following, we will make a recommendation on how to proceed. The market needs for these technologies is demanding and we need to work quickly to begin aligning these activities. It might not be necessary for everyone to be still until a complete solution is at hand, it may be acceptable to get everyone onto a roadmap in which their solutions converge.

1. Applications Contexts

The first area of similarity is how VoiceXML, WML, and even XHTML have mechanisms for sharing data (and state) between documents.

1.1 Applications in VoiceXML

In VoiceXML [VXML], a set of documents is called an application. All documents in an application share what is called an application root document.

When a document is loaded, the document's application root document is also loaded. The document's application root document remains loaded while the user interacts with the document and while the user transitions from one document in the application to another document in the same application.

In VoiceXML, the application root document is used to store information (variables and grammars) that can be shared or active between documents in the same application.

1.2 Contexts in WML

WML does not have an explicit notion of an application, but it does have the notion of a context.

WML state is stored in a single scope known as the browser context. In WML, all documents share this single browser context.

When a deck is first loaded (the name for WML documents), it can initialize the browser context to a well defined state. When the last deck in a interaction sequence is unloaded, the browser context can similarly be re-initialized. This initialization of the context, essentially segments the context into independent subcontexts.

In WML, the browser context is used to manage user agent state (including variables and other information).

1.3 Sharing Data in XHTML

XHTML (and its ancestor HTML) don't have explicit notions of applications or contexts, but it does have a mechanism for sharing data between multiple documents. HTML introduced a mechanism called frames for allowing authors to present documents in multiple views. For example, the within a single window the author could display three different documents.

This mechanism defined two basic concepts: the frameset and the frame. A frame is a single document that is displayed in the window. The frameset is the XHTML document that describes the relationship of the frames and how they are displayed in the window.

A second feature of framesets is that they can be used to share data among several frames. This mechanism permits documents to share state information with this state information stored in the frameset document.

Therefore each of the frame documents may have an explicit data sharing relationship with the frameset document. If the user navigates to another frameset, then a new frameset and its frame documents are loaded.

There are many instances where framesets are used even when only a single document is visible. From the user's perspective it appears that each document has its own context, but in reality, a frameset is being used to store information (variables and Javascript procedures) that are shared by related documents.

1.4 Contexts in XForms

W3C's work on XForms [XFWG] is focusing on separating the presentation from the application data. The requirements for XForms identify the need to support multiple pages per form, and multiple forms per page. This relates to the need for application context and dialog model.

1.5 Common Themes

As can be seen in these three markup languages, there is a similar need for storing state information between multiple documents. In the case of VoiceXML and XHTML, an explicit document was created for storing this information -- VoiceXML called it the application root document and XHTML called it the frameset document. In the case of WML, an implicit entity was created for storing this information -- the browser context.

For the sake of clarity, we will call this explicit or implicit entity the application context. As we have explained, each of these three markup languages have mechanisms for establishing new application contexts.

In the case of VoiceXML and XHTML, application contexts are implicitly initialized (or more appropriately, new application contexts are selected) when the user navigates to a document aoutside the current application context. This works because the content author has explicitely defined which documents are within an application context.

In the case of WML, application contexts are explicitely initialized by a document. This is required because the content author does not explicitely define which documents are within an application context.

In summary, there is a need for:

an explicit or implicit entity for storing and sharing data and state information between documents (we call this entity the application context)
a means of associating a document with an application context
a means of controlling access to an application context
a means of initializing an application context

2. Dialogs

The second area of similarity is how VoiceXML, WML, and XForms have dialog-like interaction models. A dialog interaction model is one in which a presentation is made to the user and then the user is asked to provide feedback which drives navigation to the next presentation.

2.1 Forms in VoiceXML

In VoiceXML, a document (or set of documents) forms a conversation finite state machine. The user is always in one conversational state (or dialog) at a time.

There are two types of dialogs in VoiceXML: forms and menus. Forms defines an interaction that collects values. Menus present the users with options and then transition to another dialog based on that choice.

The capability to transition between dialogs is accomplished without using a true scripting language; rather, a very lightweight set of procedural elements are used to go to other dialogs, prompt the user, or submit content to the server.

2.2 Forms in XForms

W3C's work on XForms [XFWG] is focusing on separating the presentation from the application data. The group has functional requirements for presenting a form to the user and then based on the user input, automatically navigating to another form. There is a strong belief that a very lightweight scipting language is required to do simple inter-form navigation and field validation.

2.3 Cards in WML

In WML, data is structured into cards, where each card represents a single interaction with the user. The user's interaction with the card depends on the type of content in the card and how the card is rendered by the user agent.

WML also has a lightweight set of procedural elements which provide a way for the author to automatically control navigation to other cards.

2.4 Common Themes

As seen above, each of these solutions demonstrates a relatively simple finite state machine in which each step in the state machine determines the next state in the sequence. This is in stark contrast to the document interaction model. The document interaction model is one in which there is seldom an attempt to define a finite state machine and the document makes no attempt at determining the next document in a sequence.

In summary, there is a need for:

a means of breaking down a document into single-interaction based components (which we call a dialog)
a means of linking these dialogs together without requiring a scripting language
a means of passing state information from one dialog to another

3. Events

The third area of similarity is how each of these languages has the need for extensible event handling in the markup and, possibly, an event model that is perhaps different from that in the DOM.

3.1 Events in VoiceXML

VoiceXML permits events to be asserted when the user does not respond or respond intelligibly, when the user requests help, when semantic errors are found in a document, or when the document asserts events (using an element called <throw>).

When a document asserts an event, the event type can either be one of a set of pre-defined event types, or the event type can be application-defined (i.e., on the fly).

The event model is such that when an event is asserted, it is handled by the best qualified element in the current scope (the active dialog). In addition, each occurance of an event can be handled by a different element -- this is because the event handling mechanism permits different handlers for the first occurance of an event , the second occurance of an event, and so on.

Finally, VoiceXML permits partial matching of event types. For example, the "telephone.disconnect" event handler will match the "telephone.disconnect.transfer" event. This permits common event handling behavior to be specified at any level in the dialog and it applies to all descendents.

Event handlers consist of a set of very lightweight procedures for prompting or interacting with the user, navigating to other dialogs, or changing state information.

3.2 Events in WML

WML provides a mechanism for authors to bind event handlers (called tasks) to events. Events are generated by the user agent and by timers that can be created by the authored content. When the event is asserted, the event handler that corresponds to the event is executed.

Unlike VoiceXML, the event type can only be one of a set of pre-defined (called intrinsic) events. The intrinsic events do allow for vendor-specific events

The event model is such that when an event is asserted, it is handled by the current scope (the active card). WML also authors to create event handling templates which appear at the top of deck (outside the document tree of the current card).

Event handlers are generally navigational tasks -- when an event occurs, the user agent navigates from one card to another. Other tasks include refreshing the card or ignoring the event altogether.

3.3 Events in XForms

XForms is expected to use the model-view-controller paradigm for binding the user interface to the application data. The event model has yet to be worked on. In principle, events could be triggered from changes to the data, from the user interface or from outside of the document.

3.4 Common Themes

As seen above, each of these solutions demonstrates an event model that corresponds to their dialog interaction model: the current dialog generally isolates event handling, although the document may provide templates or generic event handlers as well.

In all cases, events can be generated from outside the document (through user interactions, primarily) and from inside the document (through timers or application-defined event types).

We also see that in response to events, there is a need for a set of simple tasks related to navigating between dialogs and managing state information.

In summary, there is a need for:

extensible event handling mechanism
a means of providing default event handlers (templates) and over-rides
a means of navigating to another dialog or document in response to any event
a means of managing state information in response to an event

4. Variables

The fourth area of similarity is how each of these languages has the need for variables for storing stateful information.

4.1 Variables in VoiceXML

VoiceXML supports variables that, according to [VXML], are in all respects equivalent to ECMAScript variables. Authors declare and initialized variables using the <var> and <assign> elements.

Variables can be declared in one of five different scopes: session, application, document, dialog, and element.

Variables are used for doing very simple arithmetic computations, conditional control of content presentation, and assignment to fields.

4.2 Variables in WML

WML supports variables that can be used in place of strings and substituted at run-time with their current value. Authors declare and initialize variables using the <setvar> element. The values of variables can be substituted into both text (#PCDATA) and it certain attributes.

Variables can be declared in one of three scopes: session, application, and document (through WML's access control facilities).

Variables are used primarily for run-time control of navigation (by using variables in place of static URLs) and presentation. WML does not provide facilities for simple arithmetic computations.

4.3 Variables in XForms

XForms allows authors to define hierarchically structured data which can be manipulated from scripts as objects. XForms makes it easy to specify computed values in terms of expressions over other data. You can also specify dependencies between data, e.g. that if a particular field has a given value, then other fields need to be filled out by the user. You can also constrain data values e.g. to match string patterns. The ability to specify data types makes it practical for implementations to offer special tools for selecting dates etc, and greatly reduces the need for complex scripting for checking that the form has been correctly filled out.

4.4 Common Themes

As seen above, variables are used for passing run-time data between documents (with session- and application- scoping) as well as between interactions within a document (see previous section on dialogs).

Variables are also used to tightly integrate simple, repetitive, procedural programming (and event-driven procedures) without requiring application programming interfaces. In fact, there appears to be a need procedural elements

In summary, there is a need for:

a means of declaring, initializing, and assigning variables
a means of referencing variables or using the value of variables in attributes
variable scope at various levels
a simple mechanism for evaluating basic expressions

5. Other Similar W3C Technology Areas

The Voice Browser working group [VBWG] is working on a dialog markup language for interactive voice response applications, using VoiceXML as a baseline. This group is also looking at requirements for multi-modal systems with a view to catering for these in the dialog markup language.

The HTML working group [HTMLWG] is focusing on completing work on XHTML modularization. XHTML modules provide a basis for marking up common document idioms such as headings, paragraphs and lists.

Work on CC/PP [CCPPWG] aims to provide a means to describe device capabilities and user preferences. This can be used to adapt content to match device capabilities and user preferences.

Work on the document object model [DOM] aims to provide a programming interface for accessing and manipulating, markup, style and other aspects of documents.

6. Our Recommendation

It is clear that there are a variety of languages that wish to bring W3C technologies into new problem domains. Several of these languages have introduced (or are in the process of introducing) facilities not currently found in XHTML and related technologies. We have identified four of these:

application contexts
dialog-style interaction models
markup based event processing
variables and expressions

We recommend that the W3C task the W3C-WAP coordination group to explore these four areas and:

identify whether they can share a common solution to their needs
identify a process or mechanism for arriving at that solution within the W3C organization and process

We recommend that this be done quickly, given the current parallel efforts being undertaken in the W3C, WAP Forum, and VoiceXML forum.

References

[CCPPWG]: W3C work on Composite Capabilities and Personal Preferences. The mission of this working group is to develop an RDF-based framework for the management of device profile information.
[DOM] (Document Object Model): W3C work on the document object model aims to provide a language neutral programming interface for accessing and manipulating document properties.
[HTMLWG]: The HTML Working Group has reformulated HTML in XML (XHTML) and is currently working on modularization, as the basis for combining XHTML with other document types.
[VBWG]: The Voice Browser Working Group, which works on markup for authoring interactive voice response systems, and multi-modal systems which combine voice interaction with displays, pointing devices and keypads.
[VXML]: VoiceXML 1.0, developed by the VoiceXML Forum, this is a dialog markup language for interactive voice response systems, and has been submitted to W3C as a starting point for W3C's work on dialog markup languages.
[WML]: Wireless Markup Language, see http://www.wapforum.org
[XFWG]: XForms: W3C's work on next generation Web forms. This aims to separate the presentation from the application data and logic, allowing the latter to be reused for a wide variety of different devices.