Minutes MMI workshop, day 1 (19 July 2004) am

This document minutes the question-and-answer sessions after the presentations, for the morning of day 1.

Philipp Hoschka (W3C) — Overview of W3C

Philipp is the lead for the “Interaction” domain of W3C, under which the MMI (MultiModal Interaction) working group falls. He explains W3C and the position of MMI inside it. See slides.

Deborah Dahl — The MMI working group

Debbie Dahl is chair of the W3C working group on MMI. She explains what the MMI working group is doing and the models and specifications it is working on. The overall model is called the “MMI framework.” The working group has published documents on requirements and frameworks as well as specifications of formats, among which EMMA and InkML. See slides.

Q: Who is EMMA for, content developers?

A: The “browser” developer, not the content developer. Possibly also the developer of the speech recognizer.

Q: Also for modality developer. The developer of the integration component integrates the EMMA inputs from several modalities.

A: Correct.

Q: What does a user write? (Where user is content developer.)

A: In speech case, e.g., this user writes SRGS. They don't need to know (much) about EMMA, they just need to define grammar and semantic tags. The speech recognizer generates the EMMA.

Q: Is there a model that describes the relation between the speech grammar and EMMA?

A: The grammar writer needs to annotate the grammar with semantics. There is no single, formal data-model.

Q: You talk about input, is there a model for the output side as well?

A: We have a picture, that shows the various existing formats, SMIL, HTML, SVG, etc., but it seems that input is where most work needs to be done.

Q: Are there plans for supporting EMMA in browsers?

A: We know several people are considering it. IETF is working on speech recognition and is considering using EMMA. But this is still early ideas.

Q: EMMA is good for distributed environment. If everything is in a single program, you don't need EMMA.

A (Dave): In Canon, where I work, we are looking at EMMA, not as a document, but as internal datastructures.

Franck Panaget (France Telecom) — A telecom perspective

Franck presents a talk, prepared by Keith Waters, on the viewpoint of France Telecom and the opportunities offered by agents and multimodal I/O for better adaptation to the user. “Sensors” provide input about the status of the user, such as location. See paper and slides.

Q: Do you want to use the DOM? A hierarchy of sensors? What if a user has several devices?

A: The hierarchy is for the event propagation, the sensors are in a flat space.

Q: This seems to call for a sort of user profile, which says which device is the current main one.

Q: Don't understand the role of the DOM exactly. What programming language does an application use?

A: An application can register for some of event or a change of some property. The DOM doesn't depend on a specific language. The communication isn't expressed by the framework. It will probably be some sort of local communication, but may also be a remote access. The specification gives the functionality, not the details of the implementation.

Q: There are legacy issues for location. In some countries the information may not be used. What is France Telecom's position?

A: Only if the user agrees to giving the information. Don't see a legal issue currently.

Akos Vetek (Nokia) — Nokia's perspective

Akos presents the perspective from mobile devices, which, due to their increasing power and decreasing size, are used for many more things now than just voice communications. Designing usable interfaces is a challenge. A standard authoring environment could be part of possible future work, which could be a multi-modal mark-up language or some combination of existing languages. A format for pen gestures is also still missing. Division of labor between W3C (authoring standards) and OMA (mobile related issues). See paper and slides.

Q: About gestures: do you have a common set, or is it application dependent? Any usability studies?

A: We don't have any studies, but we think it si possible to define a common set. Re-using some ideas from recognition grammars.

Q: Nokia doesn have pen input, does it?

A: Yes, we do, for chinese market, e.g.

Q: Change of environment?

A: When you're moving, speech may become impossible or you may lose network.

Q: Events?

A: Not a W3C topic, probably, more a mobile issue. Current W3C DOM is not enough, it needs mobile extensions. This is probably work for OMA.

Q: X+V?

A: That's one way. Looking if we need it. Does it scale up to other modalities? What is easy for developers?

Q (Univ. of Vienna): Does a broker mechanism need to be standardized? At what level?

A: Yes, part of profile. Partly in UAProf (RDF-based).

[Coffee break]

Jürgen Sienel (Alcatel) — next generation networks

The environment, device, network, location, the task at hand and the user's preferences influence the selection of the most appropriate interaction. There are different approaches on a device with an integrated multimodal browser or separate applications or separate devices. Approaches may also differ among server-based and client-based systems. Many opportunities for standardization. See paper and slides.

Q: What binding is missing, exactly?

A: E.g., plug-in components. Web Services (SOAP, etc.) is a possible approach. Maybe not a missing “standard,” but need some sort of service description.

Q (Max Froumentin, W3C): Client vs. server: is this likely to change in the future, since clients are getting more powerful? On the other hand, so are networks…

A: New devices will appear, but your fridge may not have a speech recognizer itself, but may use some other device that happens to be in the house.

Q: How much modality do you need, from the point of view of the user?

A: We haven't done investigations into how much users will use.

Q (Philipp Hoschka, W3C): You mentioned modality-independent applications, how does that work?

A: We thinks it is needed, but have no proposals for how yet.

Q (Johnston, AT&T): There is so much hand-tayloring, not likely to change for a long time.

A: Agree, but it's something to aim for.

Rainer Simon (FTW, Univ. Vienna) — The MONA Project

Rainer talks about multimodality and device-independence and the goals of the MONA project: a specification for a multimodal user interface description language, using XML, and a protoype implementation. The UI description contains tasks, widgets for speech, graphics, etc., as well as the actual content. The method to author MONA needs to be adapted to the designer, who often use use cases, scenarios and sketches on paper. See paper and slides.

Q: Is it an abstract or a general description? Also, your example in the task units is different for voice and graphic: some information is not given to the user in one example.

A: It is abstract, with hints for various devices. The difference is indeed there, but the user can additionally ask for help.

Q (Stéphane Boyera, W3C): Did you look at Content Selection from the DI working group?

A: There is currently no content selection, we have a bit of preferences that the author can specify. Looking at more.

Q (Stéphane Boyera, W3C): The author wants to specify layouts for different device characteristics, how does that work in MONA?

A: Currently not specificable. There is a flexible layout. We are not sure that the number of devices isn't too great for the author to specify the layout for them.

Q (Stéphane Boyera, W3C): How do you specify device characteristics, using some standard?

A: No, our own experimental system, mostly based on “browser sniffing.” CC/PP is the way for the future, though.

Q (Michael Johnston, AT&T): Any examples of speech only?

A: We make all graphical, or graphical plus speech, currently.

Q: Is it abstract or concrete, you seem to have both? What is th elogic behind the hierarchy of the format?

A: Yes, it is a mixture: a concrete model with an abstract overlay. That's why we have the five levels in our UI description. Typically, you start with concrete elements and work your way up to more task-oriented structures.

Q: How do the graphics change, when you add speech to it, according to the designers you worked with?

A: The graphics didn't actually change that much. User can take more or different steps.

[Lunch break]

Minutes MMI workshop day 1 (19 July 2004) am