W3C Multimodal Interaction Working Group

MMI Architecture

Deutsche Telekom

Telekom Innovation Laboratories (Deutsche Telekom R&D) develops multimodal user interfaces for future telecommunication services and is pleased to see the Multimodal Architecture and Interfaces specification moving forward. We have implemented a distributed multimodal interaction management system completely relying on W3C standards: an Interaction Manager based SCXML (using Apache Commons SCXML interpreter), a Graphical Modality Component based on HTML and a Voice Modality Component based on CCXML and VoiceXML. EMMA is used for representation of user input throughout the system. All components are integrated using HTTP as described by the Multimodal Architecture and Interfaces specification. These components have been successfully used in the interoperability test activity with other MMI working group members, demonstrating a distributed multimodal application using components delivered by three independent parties.

Implementing multimodal applications requires to write a lot of markup code (e.g. SCXML, HTML, grammars), which has to be synchronized. To reduce complexity in application development we developed a Multimodal Application Builder, a graphical tool to model multimodal applications and to generate most parts of the code (stubs).

France Telecom

Orange Labs (France Telecom R&D) researches and prototypes multimodal user interfaces on mobile devices. We use and promote open standards, and as such are particularly interested in the W3C Multimodal Architecture specification, which allows building multimodal applications out of specialized components delivered by multiple vendors. Speech, touch and other modalities can be managed by constituents hosted on local devices or on the web, all orchestrated into a single multimodal interaction through the Multimodal Architecture. Orange Labs has implemented an Interaction Manager based on the Commons SCXML application engine, available to Modality Components as an HTTP server. While this implementation is primarily designated for internal R&D work, it was also successfully used in interoperability tests with W3C partners, demonstrating a distributed multimodal application running on components delivered by three independent parties and integrated through the proposed standard.

JVoiceXML

JVoiceXML is an open source VoiceXML interpreter written entirely in the Java programming language, supporting the VoiceXML 2.1 standard. The strength of JVoiceXML lies in its modularized and open architecture. Besides the support of Java APIs such as JSAPI and JTAPI custom speech engines can easily be integrated into this platform. Other examples of implementation platforms are the text based platform and the MRCPv2 platform that are shipped with the voice browser as well as an integration into the MMI architecture as a modality component. It can be used within a telephony environment but also, without requiring any telephony at all, as a standalone server on a standard desktop PC.

JVoiceXML is available as open source from http://jvoicexml.sourceforge.net under the LGPL license.

Openstream

Openstream, in its ongoing commitment to open and interoperable standards as a W3C Member, leads the development of multimodal platform and context-aware enablement of speech and other modalities in mobile applications. We are proud to be one of the early and active developers of the W3C Multimodal Interaction Architecture. We have implemented a reference architecture for MMI comprising of Interaction Manager(IM) and several modality components(MCs) as part of our context-aware multimodal platform Cue-me. Adopting the MMI architecture and markup languages such as SCXML, InkML, EMMA, etc. enabled us to easily incorporate and inter-operate with several different vendors' products, enabling us to offer our platform and solutions on a broad array of devices and systems.

Telecom ParisTech

This report describes the implementation included in the SOA2M project of the MM Group- TSI Department (Institut Telecom-Telecom ParisTech). This research covers abstract multimodal user interfaces in Service Oriented Architectures for pervasive environemment, an particularly in Smart Conference Rooms. Our prototype is an implementation of a multimodal ambient system providing personalized assistance services to different profiles of users. With this prototype we are looking to test the intelligent automatic fusion and fission of modalities. We are interested on using semantics with the W3C Multimodal Architecture specification. The group has implemented a Flex/AIR Interaction Manager with an SCXML engine, available to Modality Components as a semantically annotated service (published as web or Bonjour service), and web-based RIA modality components at different levels: the basic ones (Pointer, Selector and Graphics IN/OUT) and the more complex (a voice synthesizer and a Carousel).

EMMA

See also the EMMA 1.0 Implementation Report .

AT&T, USA

AT&T recognizes the crucial role of standards in the creation and deployment of next generation services supporting more natural and effective interaction through spoken and multimodal interfaces, and continues to be a firm supporter of W3C's activities in the area of spoken and multimodal standards. As a participating member of the W3C Multimodal Interaction working group, AT&T welcomes the Extensible Multimodal Annotation (EMMA) 1.0 Candidate Recommendation.

EMMA 1.0 provides a detailed language for capturing the range of possible interpretations of multimodal inputs and their associated metadata through a full range of input processing stages, from recognition, through understanding and integration, to dialog management. The creation of a common standard for the representation of multimodal inputs is critical in enabling rapid prototyping of multimodal applications, facilitating interoperation of components from different vendors, and enabling effective logging and archiving of multimodal interactions.

AT&T is very happy to contribute to the further progress of the emerging EMMA standard by submitting an EMMA 1.0 implementation report. EMMA 1.0 results are already available from an AT&T EMMA server which is currently being used in the development of numerous multimodal prototypes and trial services.

Avaya, USA

Avaya labs Research has been using EMMA in its prototype multimodal dialogue system and is pleased with the contributions EMMA brings to the multimodal interactions.

As a common language for representing multimodal input, EMMA lays the cornerstone on which more advanced architectures and technologies can be developed to enable natural multimodal interactions.

Conversational Technologies, USA

Conversational Technologies strongly supports the Extensible MultiModal Annotation 1.0 (EMMA) standard. By providing a standardized yet extensible and flexible basis for representing user input, we believe EMMA has tremendous potential for making possible a wide variety of innovative multimodal applications. In particular, EMMA provides strong support for interoperable applications based on user inputs in human languages in many modalities, including speech, text and handwriting as well as visual modalities such as sign languages. EMMA also supports composite multimodal interactions in which several user inputs in two or more modalities are integrated to represent a single user intent.

The Conversational Technologies EMMA implementations are used in tutorials on commercial applications of natural language processing and spoken dialog systems. We report on two implementations. The first is an EMMA producer (NLWorkbench) which is used to illustrate statistical and grammar-based semantic analysis of speech and text inputs. The second implementation is an EMMA consumer, specifically a viewer for EMMA documents. The viewer can be used in the classroom to simplify examination of EMMA results as well as potentially in commercial applications for debugging spoken dialog systems. In addition, the viewer could also become the basis of an editor which would support such applications as human annotation of EMMA documents to be used as input to machine learning applications. For most of the EMMA structural elements the viewer simply provides a tree structure mirroring the XML markup. The most useful aspects of the viewer are probably the graphical representation for EMMA lattices, the ability to see timestamps as standard dates and the computed durations from EMMA timestamps. The two implementations have been made available as open source software (http://www.sourceforge.net/projects/NLWorkbench).

Deutsche Telekom, Germany

Deutsche Telekom AG is pleased to see the W3C Extensible Multimodal Annotation markup language 1.0 (EMMA) recommendation moving forward and is happy to support the process by providing the following implementation report.

We made use of EMMA within various multimodal prototype applications. Among others, EMMA documents have been generated from within VoiceXML scripts using ECMASrcipt and sent to a server-based Interaction Manager (see MMI architecture for more details http://www.w3.org/TR/mmi-arch). The Interaction Manager - implemented using an early version of an SCXML interpreter (see http://www.w3.org/TR/SCXML) - acted as an EMMA consumer and integrated input (represented using EMMA) from various modalities. EMMA documents have also been used for communication between various dialog management modules.

From our implementation experiences we note that EMMA has proven to be a valuable specification for representation of (multimodal) user input.

DFKI, Germany

Central topics of the research at DFKI IUI department are multimodal interaction systems and individualised mobile network services. DFKI is pleased to contribute to the activities of the W3C Multimodal Interaction Working Group. DFKI looks at the Extensible MultiModal Annotation (EMMA) 1.0 Candidate Recommendation like as a usefull instrument for data representation in multimodal interaction systems.

EMMA 1.0 can be also used for the representation of data exchange between the components of dialog systems in a dialog server

DFKI submits an EMMA 1.0 Implementation Report that accounts for research results in two different systems, SmartWeb (Natural multimodal access to the Semantic Web) and OMDIP (Integration of Dialog Management Components). In both projects results from multimodal recognition have been encoded directly as EMMA structures. EMMA was basically conceived for the representation of user input. In order to properly represent also the output side and at the same time using a common interface language for communication among all system components, we decided to introduce an extention to the EMMA standard, SW-EMMA (SmartWeb EMMA), with specific output representation constructs. For the communication between multimodal dialog manager and the application core, which provides semantic access to content from the World-wide Web, the XML-based SW-EMMA constructs have been reproduced within the given ontology language RDFS and incorporated into a dedicated discourse ontology.

KIT, Japan

Kyoto Institute of Technology (KIT) strongly supports the Extensible MultiModal Annotation 1.0 (EMMA) specification. We have been using EMMA within our multimodal human-robot interaction system. EMMA documents are dynamically generated by (1) the Automatic Speech Recognition (ASR) component and (2) the Face Detection/Behavior Recognition component in our implementation.

In addition, the Information Technology Standards Commission of Japan (ITSCJ), which includes KIT as a member, also has a plan to use EMMA as a data format for their own multimodal interaction architecture specification. ITSCJ believes EMMA is very useful for both uni-modal recognition component, e.g., ASR, and multimodal integration component, e.g., speech with pointing gesture.

Loquendo, Italy

Loquendo is a strong believer in the considerable advantages that speech and multimodal standards can bring to speech markets, and continues to actively support their development and deployment. As a participating member of the W3C Multimodal Interaction working group, Loquendo welcomes the Extensible MultiModal Annotation (EMMA) 1.0 Candidate Recommendation.

EMMA 1.0 allows to create rich annotations for inputs of different modalities within a Multimodal Application. For instance, EMMA 1.0 is used as an annotation format for speech and DTMF input within Media Resource Control Protocol version 2 (MRCPv2). However, EMMA can also be used by gesture or pen modalities, and it offers interesting features to represent complex semantic information within an Interaction Manager.

Loquendo is very pleased to be able to contribute by submitting an EMMA 1.0 Implementation Report which covers the relevant features for an EMMA producer of voice and DTMF results. EMMA 1.0 results are already available for the Loquendo MRCP Server (a.k.a. Loquendo Speech Suite) to promote its quick adoption for the benefit of the speech market, especially for the integration of advanced speech technologies by means of MRCPv2 protocol in present and future platforms, both in speech / DTMF contexts and, more in general, in Multimodal application contexts.

Loquendo is continuing to give its complete and wholehearted support to the work of the W3C Multimodal Interaction and Voice Browser working groups, as well as to the IETF and the Voice XML Forum, as part of its continuing commitment and participation in the evolution of this and other standards.

Microsoft, USA

Microsoft is committed to incorporating support for natural input into its products. Input methods such as speech and handwriting expand the types of content that can be created and stored in digital form. The usefulness of such information relies on the ability to create accurate interpretations of the data and the ability to persist these interpretations in an openly accessible format. As a member of the W3C Multimodal Interaction working group, Microsoft supports the Extensible Multimodal Annotation (EMMA) 1.0 Candidate recommendation as an effective approach to representing interpretations of natural input.

The proposed EMMA 1.0 standard provides a rich language for expressing interpretations produced by handwriting recognizers, speech recognizers and natural language understanding algorithms. A key concern for Microsoft as the company moves to support open file and data exchange formats is that the standards which are ultimately adopted allow for the accurate representation of existing data, typically represented in proprietary binary formats. The proposed EMMA standard, along with the related InkML standard, provides the desired compatibility with existing Microsoft formats. As such, Microsoft believes that the adoption of these specifications will make natural input data available to a broader range of clients, ultimately making such data more useful and valuable.

Nuance, USA

Nuance is pleased to see the W3C Extensible MultiModal Annotation markup language (EMMA) moving toward final Recommendation. Nuance has implemented EMMA in both prototype and commercial systems such as our high performance network-based speech recognition engine. We have found EMMA to be a richly expressive specification. Our experience suggests it is broadly applicable for representing user input from a variety of sources and describing the subsequent processing steps. We believe that EMMA has a promising future at Nuance, within our industry, and as a linchpin in the next generation of multimodal systems.

University of Trento, Italy

The Voice Multi Modal Application Framework developed at University of Trento, Italy (UNITN), supports the Extensible MultiModal Annotation 1.0 (EMMA) standard. UNITN recognizes the crucial role of EMMA standard in the creation and deployment of multimodal applications. We believe that EMMA covers a wide variety of innovative multimodal applications. EMMA standard provides strong support for interoperable applications based on user input modalities such as speech, text, mouse clicks and pen gestures.

We believe that EMMA facilitates interoperability of different components (GUI and VUI), and enables effective logging and archiving of multimodal interactions.

InkML

See also the InkML 1.0 Implementation Report.

Microsoft Corporation, USA

Microsoft is committed to incorporating support for natural input into its products. Input methods such as ink handwriting and diagrams expand the types of content that can be created and stored in digital form. The usefulness of such information relies on the ability to create accurate interpretations of the data and the ability to persist these interpretations in an openly accessible format. As a member of the W3C Multimodal Interaction working group, Microsoft supports the Ink Markup Language (InkML) 1.0 Candidate recommendation as an effective approach to representing interpretations of natural input.

The InkML 1.0 standard provides a rich language for expressing the archiving or streaming of digital ink content. A key concern for Microsoft as the company moves to support open file and data exchange formats is that the standards which are ultimately adopted allow for the accurate representation of existing data, typically represented in proprietary binary formats. The proposed InkML standard, along with the related EMMA standard, provides the desired compatibility with existing Microsoft formats. As such, Microsoft believes that the adoption of these specifications will make natural input data available to a broader range of clients, ultimately making such data more useful and valuable.

University of Western Ontario, Canada

The InkChat application provides a digital canvas that can be shared across platforms for remote collaboration. InkML provides the necessary platform-independent representation of digital ink. While the core language is simple, InkML is rich enough to support a broad range of devices, allowing high-fidelity capture and rendering.

The streaming features of InkML 1.0 allow InkChat to capture a written collaboration as it develops, record it, and play back an animation of the session, synchronized with the recorded conversation. The archival and annotation features allow our application to maintain a large collection of training samples to support our recognition engine.

InkML allows us to support multiple platforms and to future-proof our application.

Openstream Inc., USA

Openstream is committed to making interaction on the move natural & convenient through its Cue-me(tm) Context-aware Multimodal Platform and solutions and continues to actively support the development and deployment of interoperable products and solutions that are based on Open standards.

As a participating member of the W3C Multimodal Interaction working group, Openstream supports the Ink Markup Language (InkML) 1.0 Candidate Recommendation as an effective approach towards providing natural interaction with software applications.

InkML1.0 together with related EMMA standard allows to create rich annotations for inputs of different modalities within a Multimodal Application using Open Data Format in a platform-independent way.

Openstream is very pleased to be able to contribute to the development and submit an InkML 1.0 Implementation Report which covers the relevant features for natural interaction in multimodal application development. InkML annotations of images is already available through Openstream's

Cue-me(tm) Context-aware Multimodal Platform and applications.

Openstream continues its commitment and participation in the evolution of this and other standards.

Hewlett-Packard Labs, India

We found the specification to be sufficiently rich in structure to allow the representation of digital ink data with loss of fidelity. This allows the specification to be used to support a range of applications, from forensic analysis and whiteboard sharing to handwriting recognition and dataset creation. The description and examples in the specification are easy to understand. The core set of elements required for representing digital ink for basic applications is small, allowing novice developers to get started rapidly and use the more advanced aspects only as and when required. The specification includes ways of compressing digital ink payloads for transmission, and simplifies cross-platform uses of digital ink. At HP Labs India, we have used InkML for various applications such as pen-based form filling (http://www.youtube.com/watch?v=A6o702CECGk) and cross-platform collaborative inking (http://www.youtube.com/watch?v=LwigOiZt6jM), in tools for handwriting annotation, and to define UPX (http://unipen.nici.kun.nl/upx/) a proposed XML standard for online handwriting datasets .

EmotionML

See also the EmotionML Implementation Report.

Alexandre Denis, LORIA laboratory, SYNALP team, University of Lorraine, France

The LORIA/SYNALP implementation of EmotionML is a Java standalone library developed in the context of the ITEA2 Empathic Products project (11005) by the LORIA/SYNALP team. It enables to import Java objects from EmotionML XML files and export them to EmotionML as well. It guarantees standard compliance by performing a two steps validation after all export operations and before all import operations: first the EmotionML schema is tested, then all EmotionML assertions are tested. If one or the other fails, an error message is produced and the document cannot be imported or exported. The library contains a corpus of badly formatted EmotionML files that enables to double check if both the schema and the assertions manage to correctly invalidate them. The API is hosted on google code and is released under the MIT License.

Gerhard Fobe, EmotionML library for C#, Chemnitz University of Technology, Germany

The EmotionML library for C# is a standalone open source library for EmotionML written in C#. With the help of this library you can parse and create EmotionML. Beside standalone EmotionML documents, the plug-in version for the inclusion of EmotionML in other languages is supported. With the help of the integrated EmotionML-parser it is possible to create related object instances automatically. Because of validating against the EmotionML schemata, the library ensures output of valid EmotionML.
The library was created as part of master thesis of Gerhard Fobe using the plugin-in version of EmotionML in XMPP and having some embedded RDF in the EmotionML.
EmotionML library for C# is hosted on Github and it is released under FreeBSD License.

Deutsche Telekom

Telekom Innovation Laboratories (Deutsche Telekom R&D) welcomes a standard format in its developments concerning emotion processing services. Working primarily as a system integrator, standardized interfaces are extremely important for us to plug together components from different providers. We have implemented a workbench (named "Speechalyzer") that makes it possible to manually annotate very large quantities of audio data with emotional categories in a fast and effective manner, train a statistical classifier and evaluate the performance. EmotionML is used as the import and export format. The Speechalyzer is available as open source from https://github.com/dtag-dbu/speechalyzer under the LGPL license.

Multimodal Interaction Activity

MMI Architecture

Deutsche Telekom

France Telecom

JVoiceXML

Openstream

Telecom ParisTech

EMMA

AT&T, USA

Avaya, USA

Conversational Technologies, USA

Deutsche Telekom, Germany

DFKI, Germany

KIT, Japan

Loquendo, Italy

Microsoft, USA

Nuance, USA

University of Trento, Italy

InkML

Microsoft Corporation, USA

University of Western Ontario, Canada

Openstream Inc., USA

Hewlett-Packard Labs, India

EmotionML

Alexandre Denis, LORIA laboratory, SYNALP team, University of Lorraine, France

Gerhard Fobe, EmotionML library for C#, Chemnitz University of Technology, Germany

Deutsche Telekom