Device Independence and Multimodal Interaction on the Web

Bert Bos

Part of: Module Interface Homme-Machine
January 20, 2005
ESSI, Sophia-Antipolis

Stéphane Boyera
<boyera @w3.org>
(W3C, France)

Bert Bos
<bert @w3.org>
(W3C, France)

Context

Because:

We don't know the best interface for
- everybody
- in every situation
The content author doesn't know either
Better interfaces may be available in the future

Approach

Therefore:

Separate the things
- specific to a particular interface from
- things independent of interface
- (it's a scale, not a clear separation)
Represent the information/service in levels
- The bottom layer is (hopefully) device-independent
- The top layer can be changed by the reader

W3C doesn't standardize Human-Computer Interfaces, but tries to define levels of abstraction, modules for various kinds of information & services, and the interfaces between them, so that you can replace the human-computer interface for some information or service with as little impact on the other modules as possible.

One classic example is separation of style and structure in documents. Most documents lend themselves well to that. That's why we have separate technologies (modules) for the structure of the content (HTML) and for the style (CSS and XSL).

It should be possible for the reader to choose his own interface to some data, depending on his preferences and the devices he has access to. But in some cases the best results are obtained if the provider of the data and the consumer negotiate the best interface. Some of the work of the Device Independent working group consists of identifying the situations in which negotiations is necessary (or desirable) and developing the protocols for negotiation.

CC/PP and Media Queries are two examples of such protocols. Usually, such protocols are fairly simple. There is only one round of negotiation: either the server gives a list of what it has and the client selects what it wants, or the client gives a list of abstract preferences and the server sends what best matches that list. A bit of both is also possible.

Kinds of interfaces

Modalities: graphical, text, speech, braille…
Interactive, non-interactive (printer)
Symbolic (text commands), direct-manipulation (mouse, steering wheel)
Combinations of the above

But also:

Digital: an intelligent agent on the user's behalf
- = Semantic Web

One particular kind of human-computer interface we want to support is intelligent agents. A person may have his own interface to the data in the form of an agent that, on the one hand, knows the user, and on the other hand, knows how to surf the Web. The user can, e.g., talk to the agent in natural language, and the agent will search and aggregate information.

For that to work, either the intelligent agent needs to be as intelligent as a human, or the information on the Web needs to be machine-readable and annotated with sufficient semantics. The former is rather a long-term project, the latter is what we call the Semantic Web.

Separating the modules is nice in principle, but it leads to some new challenges, in particular in multi-modal interfaces. For example, it isn't too hard to take a well-designed HTML page and make it usable on a PC screen, on a mobile phone, on a speech browser and when printed. That can usually be done by adding some (partial) style sheets in CSS, or, if the document is complex, in XSL. The person designing the rendering for a mobile phone doesn't need to be the same as the one doing it for the PC screen. But a multimodal interface typically needs communication between devices, or support from the origin server to coordinate the devices.

In a car, e.g., you may interact with a navigation system. Depending on whether you drive or are parked, more or less of the I/O may be done via a screen and buttons vs via voice. Thus, you need not only the original information service and a few style sheets for the different devices, but also an interaction manager that can dynamically change the role played by each of the devices.

Multi-everything – the challenge

Output: screen, speech, printer…
Input: keyboard, pen, voice, mouse…
Sessions: change of device in the middle…

Imagine you're in a car…

[genie]

Multi-media usually refers to simultaneous output in sound, graphics and text. Multi-modal adds the interactive aspect: simultaneous, coordinated input via multiple devices.

Imagine, e.g., that you have a device with knowledge of the train schedules, that displays a map and accepts both touch screen input and spoken input. It can compute the best route from one place to another. You just have to tell it where you want to go from and where to. The easiest for the user may be to combine modalities: he speaks "from here" and points to the map, instead of selecting the "from here" command from a menu or speaking the name of the destination. The device will have to have a “parser” that accepts input from both modalities.

Imagine you're in a car…

Cars are interesting examples of multi-modality, because they exhibit different bahvior, based on the circumstances: normal driving, dangerous situations, or standing still.

This photo shows the screen of an in-car information system, but the screen is only a small part of it. When driving, the screen displays very little information, because the driver can't look at it for more than a fraction of a second anyway.

Instead, the system can speak to the driver. It also can make various sounds, and it controls the radio and CD player as well, so it can lower the volume if it needs to say something, or when a phone call comes in.

Input to the system can be via the touch screen, but while driving, that doesn't work. There is also a knob on the console, the controls of the radio and CD player serve as input, the hands-free set for the telephone doubles as voice input for the system, and there is input from various devices in the car: speed, GPS position, tire pressure, fuel level, outside/inside temperature, traffic information, etc.

The car scenario: output

[genie]

fuel level, speed, GPS coords, tire pressure…
radio, CD, song titles,
telephone,
traffic info, navigation/route planning…
Web (hotels, tourism, photos…)

Which information is given when and how depends on various factors. Some information is permanently visible (speed, fuel level), some is visible when the user asks for it (song title, outside temperature), some is visible when the system thinks it is necessary (low-fuel warning, traffic info).

The car scenario: input

[genie]

knobs, handles, pedals…
touch screen, keyboard, buttons…
voice commands

no keyboard no mouse

There are various ways to interact with the system, and some are more appropriate than others, based on the circumstances.

The car scenario: sessions

[genie]

important info comes in
car stopped, hands & eyes now available
continue phone conversation outside
etc.

On a large screen, a session is typically a window. You can “interrupt” the session by putting another window in front of it. The computer has little management to do. It keeps all sessions opens, but it is the user who manages them, thanks to his “external memory,” which is the screen.

When the devices are smaller, you don't have the luxury if using the screen as a visual memory. When the voice synthesizer and the screen in the car were being used for navigation and an important message interrupts the session, or the user changes channels on the radio, it is the user's own memory that needs to keep track of what was happening in each session. In other words, there can be only a small number of parallel sessions, unless the system can help the user recall the history and context of each session.

The system also needs session management to move sessions from one device to another. E.g., when you are driving and using the hands-free set for a telephone call, you may wish to continue the call outside the car using the handset, as soon as you are parked.

When there is one central unit and several fairly uninitelligent devices (a screen, a voice synthesizer, some buttons and handles, etc.), that is still manageable, but when the devices themselves are intelligent (the telephone handset is in fact a smartphone, the GPS device can be used stand-alone as well), there is the added complexity of deciding which device is best suited to do which part of the computation and which one controls the less intelligent devices.

The multimodal framework (input)

Input

→

Speech

→

grammars

→

interpretation

→

EMMA

→

integration

→

Interaction Manager

←→

application functions

→

Handwriting

→

EMMA

→

Keyboard

→

EMMA

→

Mouse

→

EMMA

→

←→

session component

→

Etc.

→

EMMA

→

System-generated

→

EMMA

→

←→

system & environment

←

(output)

←

MMI isn't the same as "input," (the "I" means "Interaction") but it fits nicely in my talk this way.

Device Independence (output)

Output	→	(input)					→	Interaction Manager	←→	application functions
	←	Audio	←	styling (CSS) (XSL)	←	adaptation (DISelect) (Media Queries)	←
	←	Voice	←		←
	←	Graphics	←		←				←→	session component
	←	Text	←		←
	←	Print	←		←				←→	system & environment
	←	Etc.	←		←

Under "etc." you can think of media like print, braille, force feedback, and other physical effects, such as movement, heat and coffee…

Where's the network?

Most steps can be either on a server or on a client.

Server does everything
- Client just gets final form (SVG, audio…)
Client does everything
- Clients gets HTML + metadata, adapts it
Something in between…
… or one or more proxies

Usually, diagram of Web technology include some "cloud" that represents the Web. Where is the Web in this schema?

Client-side vs server-side

Case-by-case. Some factors:

Free up the server for more connections
Client too small or slow
Paid-for content
Harmful for children
Slow/expensive network
Re-use (re-adapt) locally

The network can be in various places. It is good to offload calculations to clients, to free up the server to handle more connections at the same time. That puts the Web "cloud" on some of the arrows far to the right side of the diagrams. But some client devices are only small and slow, and they can only handle contents that needs very little processing. Which puts the Web nearer to the left side.

And there are other reasons for doing more or less of the processing on the client side:

Some content needs to be paid for. You don't let the client decide for himself whether he has paid or not and thus may display some content or not.
Some content may be considered harmful for children. It is safer to not even let that content arrive in the local cache of the device.
Network connections may be slow or expensive, and thus, if for examplem you want to print the document that you just received on your phone, you want your phone or your printer to do the adaptation based on what you already receieved, not ask the server for the same content again.

The printer & phone scenario

image: phone-genie

Imagine browsing on a cell phone

You find an interesting Web page
But hard to read
You want to study it at your leisure
Luckily, you have a bluetooth printer
Point (phone at printer) and click

But printer is different from phone.

("Best viewed with…"!?)

The layout should probably be changed (multiple columns?), the images resized or replaced with other ones, interactive parts such as a tabbed display should now be displayed without requiring interaction.

The phone may already have received all the information the printer needs (text, images, style sheets) and it may thus be a matter of recalculating the rendering in a different context.

On the other hand, the server may have indicated to the phone that in case of printing, it has alternative content, that is better adapted. In that case, the phone, or the printer itself, may request that adapted content from the server.

EMMA

Extensible MultiModal Annotation markup language

Encapsulates semi-interpreted inputs
Confidence factors
Alternative interpretations
Time stamps
Reasons for believing a certain interpretation

http://www.w3.org/TR/emma/

EMMA example

<one-of id="r1">
  <interpretation id="int1" confidence="0.75"> 
    <x:origin>Boston</x:origin>
    <x:destination>Denver</x:destination>
  </interpretation>
  <interpretation id="int2" confidence="0.68">
    <x:origin>Austin</x:origin>
    <x:destination>Denver</x:destination>
  </interpretation>
</one-of>

InkML

Ink Markup Language

Trace of a pen
Speed
Pressure

http://www.w3.org/TR/InkML

InkML example

image: handdrawn h

<ink>
  <trace>
     10 0 9 14 8 28 7 42 6 56 6 70
     8 84 8 98 8 112 9 126 10 140
     13 154 14 168 17 182 18 188
     23 174 30 160 38 147 49 135
     58 124 72 121 77 135 80 149
     82 163 84 177 87 191 93 205
   </trace>
<ink>

SVG

Scalable Vector Graphics

Paths, rectangles, circles, fills…
Filters
Layers, transparency
Text, fonts

http://www.w3.org/TR/SVG11

SVG example

SSML

Speech Synthesis Markup Language

Voice, volume, speed
Stress
Pauses, sounds
Can be generated with CSS

http://www.w3.org/TR/speech-synthesis

Demo (film)

A day in the life…(5 min)

Web Mobile et Diversification des terminaux

Qu'est-ce que l'indépendance vis-a-vis des terminaux ?

device independence

Le point de vue de l'utilisateur
Le point de vue de l'auteur de contenu

Le point de vue de l'utilisateur

Accéder au Web qui que vous soyez, ou que vous soyez, quand vous le voulez, et quel que soit le moyen utilisé

user device independence description

Le point de vue de l'utilisateur

Accès universel à une ressource : une ressource doit toujours au moins renvoyer une présentation fonctionnelle
Identificateur de page Web universel : l'adresse d'une ressource doit être identique quel que soit le terminal.

Problématique

Exploiter les capacités des nouveaux terminaux :
- Utiliser l'ensemble des modalités disponibles (voix, ecriture, ...)
- Tirer profit des nouveaux modes d'interaction

Problématique (2)

Gestion des contextes d'utilisation
- Adapter les applications aux profils des utilisateurs
- Prendre en compte le contexte d'utilisation
- Prendre en compte l'environnement de l'utilisateur

←→	Interaction Manager	←→	application
		←→	session
		←→	system & environment

Le point de vue de l'auteur

Ecrire un seul contenu accessible par tous

author device independent description

Problématique

Ne pas dupliquer les informations generiques a toutes les plateformes
Permettre a l'auteur de redefinir des modules specifiques
Eviter la fragmentation du Web
- Par des contenus inutilisables
- Par des markups proprietaires specifiques
  - WML
  - AvantGo
  - ...

Architecture du Web Universel

Objectif: Définir une architecture incluant ces 2 aspects

Point Central: La chaine d'adaptation (du serveur jusqu'au client) à la rencontre des points de vue

Elements Cles :

Contexte de présentation
Edition unique de contenu

Le contexte de présentation

Donner au moteur d'adaptation des informations sur le contexte de présentation

delivery context presentation

Les en-tetes HTTP

4 en-tetes fournis au serveur en meme temps que la requete
- Mime types
- Lang
- Encoding
- Charset
Possibilite de fournir un facteur de qualite
1 chaine de characteres non normalisee pour identifier le navigateur (User-Agent String)

Problematique

Technologie server-side seulement
Dedies au PC classique
Informations tres limitees au regard de la disparite des equipements
Sur-utilisation de la user-agent string

CC/PP (1)

Composite Capability/Preference Profiles

Framework générique permettant à un terminal de décrire ses caractéristiques, son contexte, les préférences de l'utilisateur

Device

→

CC/PP

→

adaptation

←

DISelect,
Media Query

←

App.

Extensible (XML/RDF).

Necessary if adaptation not done by device itself.

Complement to Media Queries and DISelect.

Small file with device characteristics & user's prefs.

May be stored and identified by a URL.

CC/PP (2)

4 briques de base:

Structure: générique
Vocabulaire: spécifique à l'application
Protocole de transport: en partie spécifique à l'application
Règles de traitement: en partie générique mais utilisant le vocabulaire

CC/PP (3)

Scenario de base :

cc/pp architecture

Utilisation de CC/PP dans une architecture Web

use of cc/pp for multi-plateforme adaptation

CC/PP: Adoption

Une Instance complète: UAProf pour les terminaux WAP
- Vocabulaire et Règles de traitement définis par OMA (ex WAP Forum)
- Protocole : Whttp

UAProf : Fonctionnement

uaprof processing model

Profils UAProf

exemple de profils ccpp

UAProf : Utilisation

DELI: package JAVA + TOMCAT
Module Apache CC/PP

Edition Unique de Contenu

Principe :

Ne pas dupliquer les informations generiques et comunes a toutes les plateformes
Pouvoir decrire les specificites de chaque (type de) plateforme :
- Langage de Markup
- Layout
- Pagination/Navigation
- Selection de contenu

Approche Format Universel de Contenu

Un langage de markup qui est capable de s'adapter automatiquement aux caracteristiques du client
Candidats standardises :
- HTML+CSS
- XForms

HTML+CSS

Un contenu unique en (X)HTML
Des feuilles de style appropriée a chaque context d'utilisation (eg utilisation de CSS Media Queries)

HTML+CSS (2)

<html>
<head>
<link rel=stylesheet href="bigcss.css"
  media="screen and (min-device-height: 600px)">
<link rel=stylesheet media="handheld" href="pda.css">
</head>
<body>
<h1 name="title" class="title">This is…</h1>
<p class="summary"> This is a short summary
<p class="extra"> This is very long extra…
<p class="link"><a href="myling.html">More…</a>

Ici, la negotiation entre serveur et client est integrée dans le HTML. Tous les client reçoivent le même HTML, mais si le client est un “handheld” (un PDA, un telephone portable), il laisse le premier lien (bigcss.css) de côté est n'utilise que le deuxième (pda.css).

Probablement, le pda.css va ensuite supprimer l'alinea qui est labelisé “extra”. (Supprimer de l'interface seulement. L'alinea a déjà été téléchargé.)

Un PC classique, par contre, prend la feuille de style pour grand ecran (screen). Dans l'exemple, il y a une condition suplementaire: bigcss.css est conçu que pour des ecran d'au moins 600 pixels.

L'expression dans l'attribut MEDIA s'appele un Media Query.

Media Queries

Associate styles with types of devices
Part of CSS level 3
Extends media types (screen, print, etc.)
Several device characteristics
- width of window
- width of screen
- number of colors
- etc.

http://www.w3.org/TR/css3-mediaqueries

Media Queries example

HTML:

<link href="style1.css"
  media="handheld and (color)
         and (min-width: 400px)">

CSS:

@import "s1.css" handheld and (color);
@media screen and (max-width: 800px) {…}

Pro/Cons

Pro

Langage simple d'utilisation et beaucoup d'outils d'edition
Des milliers de pages existantes
Des milliers auteurs formes a l'utilisation de HTML et CSS
Des navigateurs sur toutes les plateformes

Pro/Cons

Cons

Prise en compte du contexte d'utilisation tres limitees
Presentation visuelle uniquement
Pas d'adaptation de markup (eg WML)
Adaptation a la plate-forme tres primitives (ergonomie / rendu graphique / interaction)
Implementation de toutes les capacites de CSS -CSS2-CSS3 dans les navigateurs

XForms

Un nouveau standard de formulaires HTML

Plus riche et plus flexible que les formulaires HTML existantes
Independent de la plateforme et de l'equipement
Integre naturellement la gestion de la pagination et du langage cible
Avec separation du contenu et de la presentation
Incluant des procedures de validations sans scripting

XForms: Exemple

<xforms>

<model>
  <instance>
  <person>
    <fname/>
    <lname/>
  </person>
  </instance>
  <submission id="form1" method="get"
   action="submit.asp"/>
</model>

<input ref="fname">
<label>First Name</label></input><br />

<input ref="lname">
<label>Last Name</label></input><br /><br />

<submit submission="form1">
<label>Submit</label></submit>

</xforms>

Pro/Cons

Pro

Separation complete de la logique et de la presentation graphique (ou sonore)
Prise en compte de la pagination par le client
Elimination de la validation par scripting

Cons

En cours de deploiement et d'adoption
Dedie a la gestion de formulaires
Implementation complexe sur des terminaux limites

Approche Adaptation de Contenu

un langage source pour les informations generiques
un ensemble de procedures d'adaptation de contenu

Des Langages XML specifiques et XSLT

Definition d'un langage XML approprie a l'application :

<cd>
  <artist>Aatabou, Najat</artist>
  <title>The Voice of the Atlas</title>
  <catalog>CDORBD 069</catalog>
  <time>61.15</time>
  <filed>C05 World</filed>
  <playlist>
    <work>Baghi narajah</work>
    <work>Finetriki</work>
    <work>Shouffi rhirou</work>
    <work>Lila ya s'haba</work>
  </playlist>
</cd>

XSLT

un moteur de transformation de documents XML

fonctionnement de xslt

XSLT (2)

<xsl:template match="/">
<html>
 <xsl:for-each select="cd">
  <h1><xsl:value-of select="artist"/></h1>
  <hr />
  <p…

XSLT (3)

The Voice of the Atlas

Label: , Number: CDORBD 069 , Time: 61.15

Stored at: C05 World

Playlist:

Baghi narajah
Finetriki
Shouffi rhirou
Lila ya s'haba

Implementation

Cote Serveur :
- Cocoon
- Tomcat
Cote Navigateur
- IE
- Mozilla, etc.
Standalone
- xsltproc
- saxon, etc.

Pro/Cons

Pro

Possibilite d'avoir en langage cible tous langages xml (xhtml, voicexml, wml, svg, ...)
Possibilite pour chaque cible de definir le layout, la pagination, la navigation, ...
Source unique d'informations dans le langage d'origine

Pro/Cons

Cons

Definition du langage lourde (DTD, Schema, ...)
Pas d'authoring tool facile d'acces pour des auteurs non-specialistes
Developpement des feuilles XSLT complexes
Prise en compte des informations du contexte d'utilisation difficile (integration de CC/PP)

Futur: Une approche mixte

Un langage de desciption de contenu : XHTML2 / CSS / XForm
Une version depouillee de XSLT pour l'acces au contexte d'utilisation

Futur: Une approche mixte (2)

<sel:select>
 <sel:when sel:expr="screen-width &gt; 400px
      and available-colors &gt; 4">
  <img alt="Many people evacuated" src="imagebig"/>
 </sel:when>
 <sel:when sel:expr="screen-width &gt; 100px and
      available-colors &gt; 4">
  <img alt="Many people evacuated" src="imagesmall"/>
 </sel:when>
 <sel:otherwise>
  <p>Many people had to be evacuated.</p>
 </sel:otherwise>

Futur: Une approche mixte (3)

Un langage de Layout macroscopique basé sur CSS

(banner)
(navigation) (body)
(copyright)

Des Metadata pour les moteurs d'adaptation

<div role="copyright">
  <p>Copyright © 1994-2005…
</div>

Conclusion

Beaucoup de technologies existantes qui ne sont pas ideales
Beaucoup de technologies en cours de developpements ou d'adoption
- Pour prendre en compte les contextes d'utilisation
- Pour developper du contenu multi-support / multi-modal

Si le sujet vous interesse, contactez-nous pour des stages de fin d'etudes !

Références (1)

W3C:

W3C Home Page: http://www.w3.org/
W3C Specifications (Technical Reports): http://www.w3.org/TR
W3C en 7 points: http://www.w3.org/Consortium/Points/w3c7.fr.htm

Références (2)

Cadre général DI:

Device Independence Principles : WG Note : http://www.w3.org/TR/di-princ/

Références (3)

Contexte de présentation

Delivery Context Overview : WG Note : http://www.w3.org/TR/di-dco/
CC/PP: Structure & Vocabularies 1.0: REC : http://www.w3.org/TR/CCPP-struct-vocab/
Core Presentation Characteristics, Requirements and use cases: Req WD: http://www.w3.org/2001/di/public/cpa-req/cpa-req-draft-20030117.html

Références (4)

Edition Unique de document

Authoring Challenges for Device Independence: WG Note : http://www.w3.org/TR/acdi/
Authoring Techniques for device independence: WG Note : http://www.w3.org/TR/di-atdi/
Content Selection for Device Independence (DISelect) 1.0: WD : http://www.w3.org/TR/cselection/

The end

http://www.w3.org/Talks/2005/0120-MMI+DI-ESSI/all

(banner)
(navigation)		(body)
(copyright)