Multimodal Interaction Activity
Extending the Web to support multiple
modes of interaction.
Introduction |
Current Situation |
Revised publication plan |
Work in Progress |
Email Archive |
Working Group |
Patent Disclosures |
Charter |
Activity Statement |
Contacts
News
Introduction
The Mission
The Multimodal Interaction Activity seeks to extend the Web to
allow users to dynamically select the most appropriate mode of
interaction for their current needs, including any disabilities,
whilst enabling developers to provide an effective user interface
for whichever modes the user selects. Depending upon the device,
users will be able to provide input via speech, handwriting, and
keystrokes, with output presented via displays, pre-recorded and
synthetic speech, audio, and tactile mechanisms such as mobile
phone vibrators and Braille strips.
Multimodal interaction offers significant ease of use benefits
over uni-modal interaction, for instance, when hands-free operation
is needed, for mobile devices with limited keypads, and for controlling
other devices when a traditional desktop computer is unvailable to
host the application user interface. This is being driven by advances
in embedded and network-based speech processing that are creating
opportunities for integrated multimodal Web browsers and for solutions
that separate the handling of visual and aural modalities, for example,
by coupling a local XHTML user agent with a remote VoiceXML user agent.
Target Audience
The target audience of the
Multimodal Interaction Working Group
(member only
link)
are vendors and service providers of multimodal
applications, and should include a range of organizations in different industry
sectors like:
- Mobile and hand-held devices
- As a result of increasingly capable networks, devices, and speech
recognition technology, the number of existing multimodal applications,
especially mobile applications, is rapidly accelerating. Multimodal
Voice Search in particular is a relatively new and
compelling use case, and has been implemented in applications by a number
of companies, including Google, Microsoft, Yahoo, Vlingo,
SpeechCycle, Novauris, AT&T, Openstream, Vocalia,
Metaphor Solutions and Sound Hound. Speech offers a
welcome means to interact with smaller devices, allowing one-handed and
hands-free operation. Users benefit from being able to choose which
modalities they find convenient in any situation. The Working Group
should be of interest to companies developing smart phones and personal
digital assistants or who are interested in providing tools and
technology to support the delivery of multimodal services to such
devices.
Please note that a related effort has recently been initiated in the W3C
by the HTML
Speech Incubator Group (HTML Speech XG).
The focus of the XG is developing proposals for accessing speech
recognition and speech synthesis from HTML5 browsers, and Voice
Search and Speech Command Interfaces are
possible use cases for these technologies in the browser. However, the XG
does not attempt to address modalities other than speech, such as
handwriting, emotion, or the wide variety of present and future input
modalities. Similarly, it doesn't attempt to address non-browser
contexts. In contrast, the Multimodal Architecture provides a generic
framework for modality integration and control. Speech in the browser can
be seen as a special case of the kind of modality integration covered by
the MMI Architecture. The Multimodal Interaction Working Group has been
collaboratively working with the XG, and will continue to liaise with
them on topics of common interest. For example, the XG has adopted EMMA
as a speech recognition result format.
- Home appliances, e.g., TV, and home networks
- Multimodal interfaces are expected to add value to remote control of
home entertainment systems, as well as finding a role for other systems
around the home. Companies involved in developing embedded systems and
consumer electronics should be interested in W3C's work on multimodal
interaction.
- Enterprise office applications and devices
- Multimodal has benefits for desktops, wall mounted interactive
displays, multi-function copiers and other office equipment which offer a
richer user experience and the chance to use additional modalities like
speech and pens to existing modalities like keyboards and mice. W3C's
standardization work in this area should be of interest to companies
developing client software and application authoring technologies, and
who wish to ensure that the resulting standards live up to their needs.
- Intelligent IT ready cars
- With the emergence of dashboard integrated high resolution color
displays for navigation, communication and entertainment services, W3C's
work on open standards for multimodal interaction should be of interest
to companies working on developing the next generation of in-car
systems.
- Medical applications
- Mobile healthcare professionals and practitioners of telemedicine will
benefit from multimodal standards for interactions with remote patients
as well as for collaboration with distant colleagues.
Current Situation
The Multimodal Interaction Working Group was launched in 2002
following a joint workshop between the W3C and the WAP Forum.
The Working Group's initial focus was on use cases and requirements. This led to the publication
of the W3C Multimodal Interaction
Framework, and in turn to work on extensible
multi-modal annotations (EMMA), and InkML,
an XML language for ink traces. The Working Group has also worked
on integration of composite multimodal input; dynamic adaptation
to device configurations, user preferences and environmental
conditions (now transferred to the Device
Independence Activity); modality component interfaces; and a
study of current approaches to interaction management. The Working
Group has now been re-chartered through 31 July 2013 under the
terms of the W3C
Patent Policy (5 February 2004 Version). To promote the widest
adoption of Web standards, W3C seeks to issue Recommendations that
can be implemented, according to this policy, on a Royalty-Free
basis. The Working Group is chaired by Deborah Dahl.
The W3C Team Contact is Kazuyuki
Ashimura.
We want to hear from you!
We are very interested in your comments and suggestions.
If you have implemented multimodal interfaces, please share
your experiences with us, as we are particularly interested
in reports on implementations and their usability for both
end-users and application developers. We welcome comments on any
of our published documents. If you have a proposal for
multimodal authoring language, please let us know. To subscribe
to the discussion list send an email to www-multimodal-request@w3.org
with the word subscribe in the subject header. Previous discussion
can be found in the public archive.
To unsubscribe
send an email to www-multimodal-request@w3.org with the word
unsubscribe in the subject header.
How to join the Working Group
If your organization is already a member of W3C, ask your W3C Advisory Comittee Representative (member only
link) to fill out the online registration form to confirm
that your organization is prepared to commit the time and expense
involved in particpating in the group. You will be expected to
attend all Working Group meetings (about 3 or 4 times a year) and
to respond in a timely fashion to email requests. Further details
about joining are available on the Working Group (member only
link) page. Requirements for patent disclosures, as well as terms
and conditions for licensing essential IPR are given in the W3C
Patent Policy.
More information about the W3C
is available, as is information about joining W3C.
Patent Disclosures
W3C maintains a public list of
any patent disclosures made in connection with the deliverables of
the group; that page also includes instructions for disclosing a
patent.
Revised publication target dates
Work in Progress
This is intended to give you a brief summary of each of the
major work items under development by the Multimodal Interaction
Working Group. The suite of specifications is known as the W3C
Multimodal Interaction Framework.
- Introduction,
6 May 2003. The Multimodal Interaction
Framework introduces a general framework for multimodal
interaction, and the kinds of markup languages being considered.
- Use cases,
4 December 2002. Multimodal Interaction
Use Cases describes several use cases that are helping us to
better understand the requirements for multimodal interaction.
- Core requirements,
8 January 2003. Multimodal Interaction
Requirements describes fundamental requirements for the
specifications under development in the W3C Multimodal Interaction
Activity.
Current Work
The following indicates current work items. Additional work is
expected on topics described in the Scope section of the charter.
Multimodal Architecture
Main Architecture draft
- Requirements and Capabilities, 10 May 2004
- First Public Working Draft, 22 April 2005
- Second Working Draft, 14 April 2006
- Third Working Draft, 11 December 2006
- Fourth Working Draft, 14 April 2008
- Fifth Working Draft, 16 October 2008
- Sixth Working Draft, 1 December 2009
- Seventh Working Draft, 21 September 2010
- Last Call Working Draft, 25 January 2011
- Second Last Call, 6 September 2011
- Candidate Recommendation, 12 January 2011
A loosely coupled architecture for the Multimodal Interaction Framework
that focuses on providing a general means for components to
communicate with each other, plus basic infrastructure for
application contrl and platform services. Work is continuing
on how the architecture can be realized in terms of well defined
component interfaces and eventing models.
MMI Authoring
MMI Best Practices
Extensible Multi-Modal Annotations (EMMA)
- EMMA 1.0
EMMA has been developed as a data exchange format for the interface
between input processors and interaction management systems. It will
define the means for recognizers to annotate application specific data
with information such as confidence scores, time stamps, input mode
(e.g. key strokes, speech or pen), alternative recognition hypotheses,
and partial recognition results etc. EMMA is a target data format for
the
semantic interpretation
specification being developed in the Voice
Browser Activity, and which describes annotations to speech
grammars for extracting application specific data as a result
of speech recognition. EMMA supercedes earlier work on the
natural language semantics markup language in the Voice Browser
Activity.
- EMMA 2.0
Since
EMMA 1.0 became a W3C
Recommendation, a number of new possible use cases for the EMMA
language have emerged. These include the use of EMMA to represent
multimodal output, biometrics, emotion, sensor data, multi-stage
dialogs, and interactions with multiple users.
So the Working Group have decided to work on a document capturing use
cases and issues for a series of possible extensions to EMMA, and
published a Working Group Note to seek feedback on the various
different use cases.
InkML - an XML language for digital ink traces
This work item sets out to define an XML data exchange format
for ink entered with an electronic pen or stylus as part of a
multimodal system. This will enable the capture and server-side
processing of handwriting, gestures, drawings, and specific
notations for mathematics, music, chemistry and other fields,
as well as supporting further research on this processing. The
Ink subgroup maintains a separate public
page devoted to W3C's work on pen and stylus input.
Emotion Markup Language (EmotionML) 1.0
EmotionML will provide representations of emotions and related states
for technological applications.
As the web is becoming ubiquitous, interactive, and multimodal,
technology needs to deal increasingly with human factors, including
emotions.
The language is conceived as a "plug-in" language suitable for use in
three different areas: (1) manual annotation of data; (2) automatic
recognition of emotion-related states from user behavior; and (3)
generation of emotion-related system behavior.
Related Materials
Workshops
MMI related presentations
-
Jerry Carter (Nuance), Rafah Hosn (IBM) and Kaz Ashimura (W3C)
gave talks on
"Multimodal Web to Expand Universal Access"
on 11 May 2007 during W3C Track in WWW2007 Conference, Banff, Canada.
-
InkML slides
were presented
on 24 October 2006 at IWFHR 2006 Conference.
-
W3C Seminar on Multimodal Web Applications for Embedded Systems
was held on 21 June 2005.
-
W3C Workshop on Multimodal Interaction
was held on 19-20 July 2004 in
Sophia Antipolis, France. (schedule, papers.)
-
IST-FP6-001895 "Multimodal Web
Interaction" (MWeb) Project: A W3C initiative funded by the
European Commission in support of the development and adoption
of W3C standards that enable multimodal Web access via mobile
devices. MWeb includes European outreach and the development
of demonstrators.
-
Openstream
Multimodal Interaction use case demo (Macromedia Flash video).
-
The W3C Voice Browser working group
published a set of requirements for
multimodal interaction in July 2000. The working group also
invited participants to demonstrate proof of concept examples of
multimodal applications. A number of such demonstrations were shown
at the working group's face to face meeting held in Paris in May
2000.
-
To get a feeling for future work, the W3C together with the WAP
Forum held a joint workshop
on the Multimodal Web in Hong Kong on late 2000. This workshop
addressed the convergence of W3C and WAP standards, and the
emerging importance of speech recognition and synthesis for the
Mobile Web. The workshop's
recommendations encouraged W3C to set up a multimodal working
group to develop standards for multimodal user interfaces for the
Web.
-
The IETF
Speech Services Control (SpeechSC) working group is developing
protocols to support distributed speech recognition, speech
synthesis and speaker verification services, and expects to take
advantage of W3C's work on the speech recognition grammar
specification (SRGS), the speech
synthesis markup language (SSML),
semantic interpretetation (SI)
and extensible multimodal markup annotations (EMMA). ETSI's STQ Aurora project
is looking at codecs optimized for distributed speech recognition.
See also David Pearce's presentation on
DSR to the W3C VB/MMI working groups on 25th May 2005.
-
ETSI standard ES
202076 defines a generic spoken command vocabulary for
controlling common operations such as calling someone by saying their
name, browsing through a voice mail box, adjusting the volume, muting
the microphone and other device properties. ETSI provide bindings for
the vocabulary to a variety of human languages. This suggests the
possibility of device-based recognition for common spoken commands
together with network based recognition for other vocabularies.
-
Another idea is to couple a local graphical user interface
with a remote voice dialog engine, perhaps based upon VoiceXML. Here the idea is to allow events
to be passed between the device and the remote dialog engine. To
the application developer, these events would look just the same
whether they originated locally or remotely. In this approach,
events can be used to initiate a range of actions, for instance,
changing the focus of interaction, setting the value of a form
field, loading a new page, or altering the current page via the
DOM. W3C work on REX aims to provide an XML
grammar for DOM events with a view to supporting distribution
of events, and in principle, could be used to couple different
modality components.
-
SIP can also be used to synchronize several devices, for
instance to update the display on a PDA, automotive or desktop
system in concert with the much smaller display on a cellphone.
When it comes to setting up a session that potentially involves
multiple devices and servers, SIP looks like it will provide an
effective solution together with server-side scripts. The Voice
Browser working group's work on call
control may prove valuable.
-
ETSI
EG 202 191 - V1.1.1 - Human Factors (HF); Multimodal interaction,
communications and navigation guidelines (PDF). A study of design
principles for multimodal applications with a focus on accessibility.
Published August 2003.
-
InkXML specification
(W3C Members
only) contributed to W3C on 16th August 2002 by IBM, Intel, the
International Unipen Foundation, and Motorola, Inc. InkXML is a markup
language for the exchange of virtual ink, conveying such information
as the kind of pen, the color of the ink and the nature of the medium,
the pressure applied to the pen, its position and speed. InkXML can
be used to exchange virtual ink among devices, such as handhelds,
laptops, desktops, and servers. InkXML is intended to provide the
ink component of Web-based multimodal applications. The working group
consensus process will determine which ideas in InkXML will be taken
up within W3C. W3C Members can view the
contribution letter.
-
Multimodal browser
architecture (PDF) by Stéphane Maes (IBM), dated 20th August
2001. Makes the case for using the model-view-controller paradigm
and presents a variety of architectures for synchronization across
modalities and devices. This is the presentation (T2-010705) that
Stéphane gave to the 3GPP T2 meeting in September 2001.
-
Multimodal access position
paper (PDF) by Nathalie Amann, Laurent Hue and Klaus Lukas
(Siemens), dated 26th November 2001. Describes a possible
architectures architecture for multimodal interaction based upon
coupling a visual client with a VoiceXML interpreter.
-
Towards SMIL as a foundation for
for multimodal, multimedia applications (PDF), by Jennifer
Beckham (University of Wisconsin), Giuseppe Di Fabbrizio, and Nils
Klarlund (AT&T Labs), dated 1 October 2001. Shows how SMIL can
provide fine grained synchronization control for multimodal
interaction. The approach combines SMIL with markup for control of
speech engines.
-
XHTML+Voice, W3C Submission
by IBM, Motorola and Opera Software, dated 30th November 2001,
shows how markup for XHTML and VoiceXML can be combined to support
multimodal interaction.
An
updated version (X+V 1.1)
was contributed to the Voice Browser and
Multimodal Interaction working groups on 11th March 2003, see the
Team Comment for
details of associated IPR disclosures. W3C Members can view the
contribution letter.
-
The SALT Forum was
launched on 15th October 2001 with a mission to develop standards
for speech enabling HTML, XHTML and SMIL. More recently, it has
been applied to speech enabling SVG.
The SALT
1.0 specfication was contributed to the Multimodal Interaction
and Voice Browser working groups on 31rd July 2002, and the
working group consensus process will determine which ideas in SALT
will be taken up within W3C. W3C Members can view the
contribution letter. The SALT+SVG profile was provided as a
subsequent contribution.
-
3GPP is studying different
ways to include speech-enabled services comprising both speech-only
and multimodal services in 3G networks. One option for distributed
speech recognition is based on the ETSI's STQ
Aurora developments. Other options are dependent on the general
study on speech enabled services. 3GPP may be interested in working
on integrating remote access to speech synthesis resources. W3C
should keep a watching brief. There is a possible connection to the
IETF
Speech Services Control Working Group (SpeechSC), which is
developing protocols for distributed access to speech synthesis,
recognition and speaker verification services (MRCP)
For more details on other organizations see the Multimodal Interaction
Charter.
Kazuyuki Ashimura <ashimura@w3.org>
- Multimodal Interaction Activity Lead