Workshop meeting notes taken by Dave Raggett Q&A * Nuance - Steve Ehrlich. (slides) Mike Robin - asks about WAP makes use of networked speech? Nuance is less interested in this approach than the others. * Digital Channel Partners - Daniel Applequist (no slides) What kinds of multimodal services do informational and transactional service providers want to provide? Platforms such as WAP, Voice, DTV, Desktop etc. Talks about how content is generated starting with a database and flowing into different channels, being tailored as appropriate to each platform. What are the issues? Contextual clues needed for each modality. An authoring step is required to provide these. Andrew Scott - how do you see the markup language choice effecting this. Daniel replies that he is more interested in working back from the applications to understand the needs. Do you want one language for all modalities? No in short term, sending info for all modalities would waste bandwidth. Stephane Maes - you can transform to distill down to what is needed for delivery, hence conserving bandwidth * Philips - Eric Hsi (slides) Trends. Multi-modal content adaptation for universal access. Would like discussion on following points: - Do we need a new standard or are existing ones sufficient? - Synchronization across modalities, what level of granularity is needed? - Separating presentation from content is hard if not impossible * SEC, Sachiko Yoshihama (slides) SEC is a software vendor from Japan with WAP related products. HTML is more popular than HDML/WML in Japan. 9.8 million users for iMode vs 3 million users for EZWeb which is based on HDML and WML 1.0. More content, tools, knowledge. SEC encourage convergence of WAP and W3C. SEC think convergene of WML and VoiceXML since speech is natural useage of cell phones and complements weak input capabilities. Jim Larson - asks if HTML is more popular than WAP because it appeared earlier? Yes this is one reason. Charles McCathy-Neville - asks if you think content providers will be comfortable with authoring in WML and HTML? Scott McGlashan - are there services which are only available in iMode and not via WAP? Sachiko - notes that WML has context management features which are missing in Compact HTML. Andrew Scott - thinks the data on relative popularity doesn't show the full picture. * Conversa, Mike Robin (slides) Inventing new languages is generally the wrong thing to do. Conversa focusses on adding voice features into the client platform. In other words, powerful clients. Multimodal IVR (voice+DTML) <--------|--------> WML/HTML(Graphics+Clicks) Talks about architectural choices for distributing work between client and server. One issue is security. Need for nested security contexts and a trust network. Another is for recovering from connection failures. Ted Wugofksi - cell phone with intermittent connection. Does Mike think consumers will tolerate reduced vocabulary when the phone can't reach the server (e,g. via Aurora) Dave Bevis - asks Mike to expand on the security problems Mike - applications crossing many servers. ----- 10am Brainstorming Session Jim lists issues identified from position papers and ask for additional issues: - Integration of WML and VoiceXML - Single authoring for multiple modalities (multiple use) - re-use of existing content - dialog authoring language - dialog differences and synchronization - Multiple authoring - Authoring systems - Architecture Convergence of GUI markup and VoiceXML - variations of how to deal with ASR - difficulties in separating content from presentation - Ergonomics of switching between listening and watching (moving phone from ear to eye). End user oriented issues - feature bloat for combined language - Multi-modal interaction issues - synchronization issues - dialog management - events - Push/Pull issues - Contextual cues specific to particular modalities - knowing where you are and where you can go - Semantic Binding (where and how) - Other modes, e.g. video, handwriting ... Jim labels the groups and invites people to write their initials against the following set of topics: 1. Synchronization and Multimodal interaction/dialog management 2. Ergonomic issues 3. Push-pull issues 4. Contextual clues 5. multiple authoring 6. Authoring systems We break for 30 mins. * HP Labs, Marianne Hickey (slides) Marianne talks about the W3C Voice Browser multi-modal dialog requirements (in her role of leader of this work in the Voice Browser working group). Multi-modal: 3 main approaches - modes are used in parallel and you choose whether to use speech or keypad at any point - complementary use of modalities with different info presented via different modalities - coordinated input from multiple modalities is seen as lower priority (an area for future study) Jim Larson: How does the multimodal language you have defined so far relate to WML, what is the relationship? Marianne says her examples place the voice dialog in control with WML or other GUI markup in a subservient role. Her example uses HTML. * PipeBeach, Scott McGlashan (slides) Scott's presentation covers following topics: - voice browser dialog requirements - work on transcoding WML to VoiceXML - language integration issues W3C is basing its work on dialog markup on the VoiceXML submission from the VoiceXML Forum (May 2000). Transcoding arbitrary HTML into VoiceXML is practically impossible. Transcoding WML into VoiceXML is more tractable. WML apps often use abbreviations on account of small display size. Another issue is support for free text input in WML, something that is problematic for VoiceXML. Rather than directly going for an integrated approach, Scott is interested in extending WML to support speech interaction. Charles: why not start from a service description? Scott: starting from the service data is much the easiest but doesn't address existing content. Charles: what about XForms? Scott: the Voice Browser WG is looking at this but we anticipate this effecting later versions of the dialog markup language, since XForms is still at a very early stage. * Telstra, Andrew Scott (slides) Develops shouldn't be burdened. Move the burden back up the chain. (what does he mean by this?) Authors need to control presentation for different modalities. Andrew cites "Wednesday" as an example "Wed" on WAP and pronounced as "wendsday" when spoken. A simple mechanism is needed for authors to express alternatives for different media/platforms Daniel: authors are unwilling to give the necessary info for modalities other than the one they are focussing on right now. Dave Raggett: asks what Andrew meant by "chain". Ans: users -> content developers -> infrastrucure providers where up is to the right. Lunch Break Summer Palace took 1 hour 25minutes! * NTT DoCoMo, Kazunari Kubota (slides?) * IBM, Stephane Maes (slides) * NEC, Naoko Ito "XML document navigation language" (slides) XDNL defines a document flow but not a full dialog model. It uses the same syntax as XSLT for ease of understanding. Ted: asks about how XLink is being used to break things up. Example shows every two titles in a list of titles being show as a separate "document". This takes advantage of the for-each mechanism and the counter-size attribute. 2:30 Brainstorming on single authoring (1 hour) * PipeBeach, Scott McGlashan (slides) Scott talks about the convergence of WAP and VoiceXML architectures. VoiceXML browser in the network, WML browser on the mobile device, where browsers synchronize via a control document. Push notifications used for synchronization. Requires WML 1.2 with push gateway. Pro: Independent browsers, no change to ML's. Reusable standalone. Simple to create. Cons: no tight synchronization on local transations, timing ... concurrent voice and data requires GPRS/3G completely separate services/content * Motorola, David Pearce "DSR ETSI/STQ-Aurora" Distributed speech recognition. Current ETSI spec Feb 2000, for Mel-Cepstrum front end. Ongoing work on advanced front-end to half error rate in presence of noise. DSR works particularly well in weak signal strengths when compared to server based speech recognition using GSM to convey the audio. Merge of WML (HTML) and VoiceXML. Complete control in the terminal. Parallel for voice and visual. Thin clients - all processing handed off to server. Fat clients do all the work locally. Intermediate architectures are also possible. ETSI is keen to promote the merged language approach, with speech handled as data and transferred in parallel with markup. Stephane: voice channel is cheaper than data channel, bit for bit. Alastair: in volume the data channel is cheap. Reports from Break-out sessions: Charles McCathyNevile - Integration of WAP/VoiceXML (html) Dave Raggett - Single Authoring (text) Note: Some authors don't care about modalities, and the authoring system should be able to fill in for the roles of the graphic designer etc. for the different modalities. ??? - Architecture Convergence (slides) Volker Steinbiss - Dialog management, Synchronization, and Multimodal interaction issues (html or word) Question about SMIL as a good starting point. SMIL doesn't really deal with dialog. Another question Jim Larson - Ergonomic and End user oriented issues 1. When is multimedia really useful versus monomedia. Handsets are not ideal for multimedia. Handsfree or distant microphone operation would be helpful. 2. Social interaction issues: - Privacy * Eavesdroppers * Appearing stupid in front of other people - No cell phone allowed in theatres - Not structured for a social role 3. Voice is emphemeral (it doesn't hang around) - a record of what was said could be useful 3. Is the device on a phone, handheld, new thing? - what is the migration path to this new thing 4. Input device problems - buttons hard to use with big fingers - device is inadvertently turned on - thumb fatigue 5. Solutions - enourage research on when multimodal interfaces are really useful - research on minimizing number of user actions/gestures to achieve particular input actions - involve social scientists to look at what people like to do/not like to do, etc. To get a better feel for appropriate useage models and constraints - design and implement social protocols for privacy and social interaction - design and implement a persistent visual channel (kind of like a short term memory) as an aide memoire - capture and publish use-cases for multimodal devices/apps - a roadmap for getting there from here - ability to keep going or at least to suspend and recover when access to server is temporarily removed - cultural sensitivities to use of color etc. Volker: you may need to change how you interact, e.g. when you walk into or out of a meeting, as this may effect whether you want to use/not to use aural interation. Dave Beavis - Authoring and rendering systems; semantic binding (slides) What is an example where the synchronization points would change dependent on the user preferences/profile. Ans: Someone with cerebral palsy would take much longer to respond (Charles). Another is when you are driving. The profile might need to take into account the transport mechanisms, so that applications can take these into consideration.