The Rise of the Multimodal Web

Andrew Scott, Telstra

Date: 22nd August, 2000


The convergence of web-browsing, phone browsing, and voice browsing sounds promising. However, care must be taken to ensure that users will have effective and usable experiences. A requirement for this is that developers will be able to produce such applications easily.

Conventional wisdom, that content can be separated from presentation easily, is misleading and doesn’t provide the developer with an easy way to produce the quality of applications required. Multi-modal devices are likely to evolve from devices like WAP phones, and perhaps the addition of a more advanced "alternative representation" mechanism to WAP applications would be a step towards what developers need.

This workshop should ensure that developers will be able to create appropriate multi-modal applications with ease.


Telstra is the largest telecommunications carrier in Australia, and participates in diverse businesses such as cable television, Internet service provision, fixed and mobile telephony, white/yellow pages, and mobile data service, including WAP. Our WAP service was publicly launched in December 1999, although we had been trialing WAP and pre-WAP for many months before this.

Today there is much excitement surrounding phone-based web browsers and speech recognition. In the rush to convergence of different technologies, alongside the convergence of markets, there is a danger that important issues for the user may be overlooked. It is important for people to find such a convergence usable and effective.

In addition, the standards for enabling this convergence should not be perceived by developers as a "barrier to entry" into content provision businesses. It is the diversity of the web developer community that has made the web into the rich experience that it is.

This document is divided into the three areas of designs, devices, and developers.


"Talking about music is like dancing about architecture" - Various

Sometimes aspects of content cannot be separated from presentation. This may be because the concepts of content and presentation overlap, or it may be because there is no way to have content without presentation. For instance, is the choice of English part of the content or the presentation?

There are many examples of applications on the web today that are designed to take advantage of characteristics of web browsers (eg. high resolution screen with many colours, a pointing mechanism for input, a text-based mechanism for input, etc.). These types of applications cannot be made to work in a compelling way within an environment that lacks these characteristics.

Additionally, there are examples of many businesses that depend on web browser characteristics to keep afloat, and hence continue to provide content to the web community. The most obvious of these are businesses that rely on the revenue from banner ads. Arguably, the ability of web pages to support advertising has created the explosion of content that we now take for granted.

Hence, we should not expect all applications to be "design once, access many ways". Admittedly, we are attempting to bring about convergence of multiple interfaces, such as text navigation (WAP) and speech recognition (VoiceXML), onto a single device. However, we should not assume that the underlying applications will converge into a single design while remaining usable and effective.


Phone-like devices that provide access to web sites are increasingly becoming commonplace. Support for these devices is provided in a large number of countries, and large user bases already exist in some cases. Today, millions of people browse the Internet on a phone.

The phone "form factor" results in a highly constrained application environment. In many ways, these constraints are more significant than high bandwidth to the device or the markup language used to deliver content. Factors such as input via a keypad, a screen capable of legibly displaying a few lines of text or a couple of pictures, limited memory and processing (at least at the low end of the market), and the need to make and receive calls. Together these factors may be regarded as constraining the "user bandwidth", the rate at which information can flow between the user and the device.

It is likely that most of these form factor constraints will remain with us for the next few years, at least. The high-end of the market will probably move beyond many of these constraints a lot sooner, however it is important to support all customers, and may involve ensuring backwards compatibility.

We may see the first multi-modal devices in less than six months. These devices would support GPRS and GSM simultaneously. It would be possible to conduct web browsing on a laptop concurrent with talking on the phone. It will not be long before users of these devices will talk and browse "wireless" sites simultaneously. Many new applications can be imagined, where voice control results in visual results, or following links on the screen results in audio content.

In general, speech recognition can improve the usability of applications that are delivered to phones, as the users are often engaged in other activities that require their hands and eyes, such as walking, driving, or taking notes. Mobile phone users are often adept at context switching, but they may also have lower rates of literacy than Internet users. It is clear that multi-modal devices will offer improvements over current devices.


Perhaps these types of devices constitute a new form factor, and current speech or WAP applications will need to be re-designed to exploit the devices to their fullest. Are these types of applications closer to voice browser or WAP browser applications?

A good working assumption is that multi-modal devices will be an evolution of the class of devices that WAP phones belong to. According to some estimates, the number of WML pages has already passed four million, and is growing exponentially.

A requirement for the success of multi-modal applications is that there will be enough developers to create enough compelling content so that enough people will want to purchase multi-modal devices. Clearly, it needs to be easy for developers to make content that will be usable for all users. This is a major issue, especially as many slightly differing browsers are appearing.

A candidate tool for this is transcoding. Developers write their content once, in a form that they are familiar with, and it is translated into the appropriate forms for users’ devices. However, as mentioned above, a "design once" philosophy is often ineffective or unusable, although it may help addressing differences within a single device class.

Another popular mechanism is the provision of alternative representations. Images and tables may have "alt-tags" that contain text to be used when they cannot be displayed. Also, embedded objects or browser features such as frames can have be given alternatives that can consist of complex content. Alternative representations can provide a graceful degredation of content in the face of constrained devices, but they are not always easy to use.

Problems with the current implementations of this mechanism are:

Perhaps a more general alternative representation mechanism could assist developers to create effective content more easily.


Telstra requires that our customers have access to effective applications with high usability. This workshop should ensure that developers will be able to create appropriate multi-modal applications with ease. Transcoding or other proposals that assume "a single design fits all" are not the whole answer.

Multi-modal applications utilising both voice and text browsing should use WML (or its XHTML equivalent) as a base. Backwards compatibility for the existing body of applications must be considered.