The markup way to multimodal toolkits

by June 11th, 2004

Stéphane Sire and Stéphane Chatty
{sire, chatty}

IntuiLab has got practical experience in the design and implementation of prototypes of multimodal applications. For instance, we develop intuitive direct manipulation user interfaces based on gesture interaction on a touch screen, and animated feedback for Air-Traffic control. We are also involved in the car industry and in European Commission-funded projects. Our multimodal stock exchange user interface prototype runs on a Pen Tablet [1]. It illustrates some of the features that we think are going to turn multimodal interaction into a foundation layer for post-WIMP (Windows Icons Menus Pointer) user interfaces:

And of course all these characteristics should be extended to take into account more input modalities like GPS and other sensors like RFID tags or tilt sensors that allow the design of context aware user interfaces. It is likely that mobile appliances, with their limited screen real-estate, will become the first devices to fully benefit from multimodal interaction techniques. For that reason we are participating in the Use-Me.Gov european project [2], to design user interfaces for public services on smartphones and/or PDAs. This project has already resulted in the definition of several use cases of services, from which it is possible to imagine scenarios that would extend the multimodal interaction use cases [3].

So far, IntuiLab has been developping its own toolkit for prototyping purposes: IntuiKit. As explained in this article we prefer to develop our own toolkit instead of prototyping in Web browsers, because we address inovative interaction techniques for which there are not necessarily markup languages readily available.

Two trends are converging in the programmation of user interfaces. The first one is the definition of more and more models of markup languages which are progressively covering more and more facets of user interface programming. The second one is the progressive integration of markup languages into toolkits based on programming languages. This is the case for instance when SVG is used to define interchangeable skins of an application.

Our position is that these trends have not yet reached their full potential. The existing models of markup languages are going to continue to be extended with more programming languages like features. The toolkits made with programming languages will be better integrated with markup languages. This will give rise to a new hybrid programming style where multimodal toolkits will offer an API in both a programming language and in XML. For that reason the design of future multimodal toolkits should stay informed with advances in markup languages, and reciprocally the design of markup languages should stay informed with the programming language constructs present in toolkits.

Iterative design process

When we design, our methodology is iterative. It relies on the ability to quickly build simulations and mockups of user interfaces for demonstrating them to the end-users. The success of this methodology depends on the availability of tools for quick prototyping.

Markup vs. programming language

At first glance, it is feasible to prototype user interfaces with the models of the W3C recommendations for the different input and output modalities with Javascript for programming the behaviour, and with a compound document model to deploy user-interfaces in a browser [4]. This approach is illustrated by numerous PDA and smartphone simulators in XHTML and Javascript which are available online. This browser-based solution brings the advantage of a solid base of available tools for prototyping. However it is also a source of limitation when it comes to experiment with more innovative interaction techniques which have not been standardized yet. For instance, imagine how difficult it would be to test a gestural interaction technique such as marking menus in a browser?

Web browsers and the W3C cannot provide a recommendation for every new interaction technique. This would result in an unmanageable situation with tons of recommendations. So, when it comes to prototype and to test innovative interaction techniques, we must admit that the browser approach backed with a set of universal recommendations is limited: the support of a programming language is still necessary, beyond scripting in Javascript.

Role of models in team work

Models increase programming efficiency. They can also serve for distributing the work on a user interface between different team members, according to their speciality. The team work methodology we have set up during the last years allows designers (for instance a graphical designer) to share designs with the programmers. On that topic we have reached the conclusion that it is a great advantage to be able to quickly and accurately transfer the designs right into the programmers' code. For the visual modality, the SVG [5] produced by the export function of most drawing application is a good media.

The model-based approach puts a high stress on the underlying rendering toolkits, which must be powerful enough to follow the evolution of recommendations. For SVG that means to develop entirely new graphical libraries, such as TkZinc, that we use in our toolkit [6]. We hope that with models for every modality, it will become possible to develop powerful editors, like Adobe Illustrator for the visual modality, and to leverage programmers' task with the direct integration of the other designers' work (dialog designer, gesture designer, etc.).

SVG is quite helpful to transfer the graphical artworks to the mockups. However the SVG model, like most models, does not cover all of the designer's tool palette. For instance the conic gradient or the path gradient are missings. This has conforted us in the idea that all the programmation cannot be done uniquely with the existing models, at least when it comes to test new interaction techniques.

Multimodal toolkit

Our vision of the future of multimodal user interface programming is that, as more and more models describe each input and output modality (with a modality defined as a channel and a coding of information like a markup language), it will be necessary to develop a core model for integrating these modalities.

This core model should remain modality agnostic: the different modalities are added as required by a particular user interface. The rest of this position paper tries to define the requirements for the core model. Some parts of it have already started to emerge here and there in different W3C recommendations. It seems that the actual direction in most Working Groups is to turn markup languages into full-blown application description languages.

The definition of the core model should be oriented towards a high usability by programmers and design team members. >From an introspective observation of our daily design activity, we have identified the following requirements:

At that point, we do not know if the core model can be entirely defined with markup languages. This is a question to investigate. Whatever the answer, the previous requirements can be fullfilled with imperative programming languages and they are partially fullfilled in user interface programming toolkits.


Structuration mechanisms are required to join together the facets (or models) that make a user interface. This composition defines the user interface architecture. Most of W3C models rely on a DOM model [7] to describe their internal structure, so the architecture of a multimodal application is a combination of several DOM trees. Actually, as there is no model for the definition of this architecture, it is possible to start with any model, for instance an XHTML tree [8], and to complete it with the other models, such as VoiceXML [9], without clear rationals about where to put the different parts. This is a domain where the scene graph approach of modern user interface toolkits could leverage the DOM model and add an extra control dimension by giving a more precise meaning to the composition order.

The composition of several trees into one user interface tree requires a binding mechanism. Actually, two mechanisms dominate: namespace integration and reference integration. Namespace integration is used in document profiles such as XHTML + Voice [10]. It composes models by encapsulating them into their respective namespaces. This allows for the creation of any type of architecture. The reference integration consists in selecting one tree as a main tree (usually close to a widget tree in classical toolkits), and in expanding parts of it with other sub-trees with a reference mechanism.

For instance in XBL [11] [12] every element with a class attribute in the application tree, can be replaced with a template based sub-tree that expands this element. The binding is declared through a -moz-binding CSS property attached to that element, that points to the definition of the template with a URI. This mechanism is comparable to the object instantiation and class definition mechanism in object oriented programming languages. A similar mechanism exists in the SVG + RCC recommendation [13]. It can be applied for instance to bind SVG representations with XForms elements.


Control structures are the weakest parts of current W3C recommendations. Most of the time control is expressed in a mixt of Javascript code and handlers declared with XML Events elements or attributes [14], with several consequences:

The second point is not a limitation in a pizza ordering application based on form entries, because there is only one flow of control that passes focus from one entry to the next one. However it becomes more confusing in a direct manipulation user interface where several objects can be manipulated concurrently, with different modalities in parallel.

Recent W3C recommendations are beginning to introduce explicit declarations of control structures, like the repeat element in XForms [15] that iterates other a set of data to generate their corresponding presentation. Without adding the complexity of a full XSLT transformation [16], this is obviously a step in the direction of programming languages. We should be carfeful to identify the most common control structures, and to factorize them so that they can share between models. For instance in our toolkit we have introduced explicit finite state machines (FSM) which can control which sub-trees to activate in a tree.

Event definition

An extensible event definition mechanism is needed for communication between user interface objects created with present and future models. This mechanism should be extensible so that when a new input or output channel is created it is not necessary to rewrite the event model, or the models that interact with these events. If, like in the EMMA recommendation [17], a distinction is made between the event data model (interpretation) and the event annotations (rdf:Description), they both should be extensible. For instance, in a multi-user interface, it may be necessary to define the user that triggered an event. In EMMA it is not clear whether this should be added to the event data model, or to the annotations. Similarly an event model should handle situations when a device (software or hardware) is used to simulate another one, such as simulating a keyboard with a virtual keyboard.


There is no clear place for defining the application data. Most of the time the data is also available from outside of the application in files, databases, or requested on the fly through Web services. Very often data is replicated in the functionnal core of the application and in one or more widgets that present it.

As explicit data models are appearing in several W3C recommendations like XForms and EMMA, it would be nice to share the same models.

Reusable components

Most user interfaces, even post-WIMP user interfaces, are made of visual components, interactive behaviours and internal control structures that repeat themselves in different parts of the user interface. We have already described the possibility to expand a given markup element into a sub-tree with a binding mechanism in the section on structuration. In XBL and in other markup languages, this feature comes with the possiblity to declare properties which can be associated with attributes of the XML element. It is also possible to declare internal properties which are only manipulable from a script and are not assignable with the XML element; they play the role of private variables in a programming language. With such mechanisms, markup languages are getting closer to the capability of programming languages to define components.

The next evolution would be to declare and to implement services for the so-defined XML components, that could be triggered from within the control structures previously defined. To a limited extent, it is the case with the script element to declare some code in a model and with the XML Event handler mechanism that triggers it. New markup is also sometimes bound to binary components (the concept of code behind in XAML [18]). This gives a simple bridge between programming languages and markup languages.


Markup language models are changing the way we design and prototype user interfaces. The separation of different modalities into different models simplifies the task of programmers by migrating some of the programming efforts into separate tasks that can be supported with specialized editors. However, when it comes to integrate together all the facets of a user interface, the programming languages are still offering services which are not fully covered by markup languages.

We are interested in participating in the Workshop on Multimodal Interaction because we would like to contribute to the definition of the markup languages that will shape the future of multimodal toolkits. In particular, we would like to clarify the way the core model is actually covered in existing models, and the way this coverage could and should be extended.


[1] IntuiLab. The multimodal stock exchange system. Online demonstration (with Flash pluggin). (demo-bourse)

[2] IST Programme 2002- Usability driven open platform for mobile government. European project. (Use-Me.Gov)

[3] W3C. Multimodal Interaction Use Cases. Note 4 December 2002. (Use Cases).

[4] The W3C Workshop on Web Applications and Compound Documents. (papers)

[5] W3C. Scalable Vector Graphics (SVG) 1.2. Working Draft 10 May 2004. (SVG)

[6] Centre d'Etudes de la Navigation Arienne. TkZinc graphical library. (TkZinc)

[7] Document Object Model. (DOM)

[8] W3C. XHTML 1.0 The Extensible HyperText Markup Language (Second Edition). Recommendation 26 January 2000, revised 1 August 2002. (XHTML)

[9] W3C. Voice Extensible Markup Language (VoiceXML) Version 2.0. Candidate Recommendation 28 January 2003. (VoiceXML)

[10] W3C. XHTML+Voice Profile 1.0. Note 21 December 2001. (XHTML + Voice)

[11] XUL Planet. Introduction to XBL. (XBL)

[12] Mozilla. Gecko Embedding Basics, XUL/XBL. (XUL/XBL)

[13] W3C. Scalable Vector Graphics (SVG) 1.2, section 4 Rendering Custom Content. Working Draft 10 May 2004. (SVG/RCC)

[14] W3C. XML Events, an Events Syntax for XML. Recommendation 14 October 2003. (XML Events)

[15] W3C. XForms 1.0. Recommendation, 14 October 2003. (XForms 1.0)

[16] W3C. XSL Transformations (XSLT) Version 1.0. Recommendation 16 November 1999. (XSLT)

[17] W3C. Requirements for EMMA. Note 13 January 2003. (EMMA)

[18] Microsoft. Longhorn Markup Language (code-named "XAML") Overview. (XAML)

Last updated 11-Jun-2004 17:46:58 CEST