Bennett Marks, Nokia
The natural convergence of audio modalities and visual modalities in the mobile arena has lagged behind the Internet in general. The economic drivers in the mobile marketplace have given us (the mobile manufacturers) the opportunity to allow a number of technologies to mature before committing to a particular path. Nokia sees huge potential benefits that may come from the synergies of a multi-modal interface. But in order to see this benefit in the mobile space a number of issues (some unique to mobile, some not) need to be well researched. These concerns fall into three areas:
Nokia is hoping to gain some insights into these issues at this workshop.
The mobile space has made strides toward converging with the Internet. At the application level there is concensus that XML is the basis for presentation on mobile devices, as it is on the PC. While the industry has made commitments regarding textual markup, no decision has yet been made committing voice or graphics to a particular markup in the mobile space. Given the work on XHTML modularization etc. we would like to explore the issue of whether or not we can combine multiple markups into a cogent flow, and if so how. Does it make sense to use various existing markup languages designed around various modalities in a combined or mixed syntax, or do we expect that we need to invent another markup better tuned to the issues of multimodal presentation?
A number of the technologies required for speech recognition, text-to-speech, and visual presentation management are considered to be computationally expensive in a mobile architecture. ETSI and other organizations are looking into the standardization of schemes such as DSR (distributed speech recognition) to effectively divide the computational requirements. This, combined with the limitations imposed by the current generations of mobile bearers makes architectural decisions such as location of computation elements, execution elements and presentation transformation visible to the application. Decisions, such as bearer selection, and PEP routing may need to be made by applications. There is a strong rationale for creating a canonical model for the architectural choices, and standardizing the decision syntax. At the current time technologies such as VoiceXML assume certain things about the implementation architecture that may cause interoperability issues in deployment. Nokia is interested in understanding these issues, and in helping to isolate architectural issues away from presentation mechanisms.
A number of the markup languages such as Xforms, WML, and VoiceXML have recognized the value of language elements which help generate a more directed dialog rather then the more freeform hypertext approach traditionally found on the internet. Reasons for this approach have been discussed widely, browsing vs. a directed task approach. Nokia is interested in understanding whether there is commonality in requirements across the domains of the wired and wireless Internet, and what effect multi-modal interfaces might have on this issue.
There is a belief that multi-modal interfaces will lead to increased interaction fidelity. This is an unproven notion in the mobile space. Nokia is interested in investigating any attempts at quantifying interaction fidelity. While there are numerous use cases showing that content presented in its "original" format has the highest fidelity, there are few if any use cases that attempt to prove that simultaneous multi-modal presentation increase end-user understanding or quantitatively increase the end-user's interaction. Examples of use cases that could be discussed are:
Presentation of music to an end-user
Multi-user team gaming
The types of end-user devices in the wireless space are increasing daily. This divergence is having a profound effect on the man-machine interface (MMI). The requirements introduced by having multi-modal interfaces (i.e. simultaneous voice and image) change the fundamental interaction that has characterized the mobile phone. It is more than just adding an earphone to the handset. There is a new set of ergonomic conflicts that must be resolved. Nokia is interested in discussing how relevant W3C and WAP standards activities might address some of the issues.
Bearer issues in the mobile space strongly influence the cost equation of delivering multimodal content. While this is a mostly practical concern, Nokia would like to work with WAP Forum and W3C on standardizing interactions with the underlying transports to allow economically sensitive implementations to be brought to market.
A proliferation of devices and capabilities make content adaption an economic necessity in the future. Beyond XSL, what role will the work being done to capture semantic intent have on content adaption. Specifically, how can multi-modal devices be serviced by the content adaption techniques that are currently being worked on in W3C?