W3C Workshop on Smart Voice Agents

This is the agenda of the W3C Workshop on Smart Voice Agents.

The workshop is free. To help us prepare and run the event effectively, please see the Call for Participation.
There will be three half-day sessions on February 25, 26, and 27, 2026.

Agenda Overview

Session 1: February 25, 2026, 12:00 GMT-5
Session 2: February 26, 2026, 10:00 GMT+1
Session 3: February 27, 2026, 12:00 GMT-5

Session 1: February 25, 2026, 12:00 GMT-5

Start times in various timezones:

PST (Pacific Standard Time): 09:00 AM
MST (Mountain Standard Time): 10:00 AM
EST (Eastern Standard Time): 12:00 PM
CET (Central European Time): 6:00 PM (18:00)
ISR (Israel Standard Time): 7:00 PM (19:00)
JST (Japan Standard Time): 2:00 AM (Next Day)

Agenda

1. Scene setting (10min): Goals and expectations for the workshop
2. Governance and Greenlights: Leveraging the '3 Ps' to Standardize Trust, Scale, and Usability in Voice Agent Web Integration (20min): Talk: Patricia Lee (10min); Bio: Growing AI Engineer with IT Executive Management experience. See also https://www.linkedin.com/in/patriciaminglee/.; Abstract: The expansion of Voice Agent technologies across diverse platforms—from smart speakers to mobile browsers—is currently blocked by issues in usability, security, and system interoperability. This presentation will leverage an extensive background in Product, Program, and Project Management (the '3 Ps') combined with Lean Six Sigma expertise to present a framework for driving effective Web standardization. The talk will focus on identifying the critical stakeholder needs that must anchor any standardization effort, moving beyond technical specifications to address real-world GRC challenges. The talk will explore two major barriers: (1) Establishing a cyber-resilient and compliant trust model that satisfies both user privacy concerns and evolving regulatory requirements, (2) Applying process optimization principles to define measurable requirements for supporting various dependencies and capabilities. This data-driven, methodological approach provides the necessary structure to clarify reasonable applications for Voice Agents and deliver the technical clarity needed for developers, standards bodies, and regulators to align and move the Web Voice Agent ecosystem toward trusted, scalable adoption.; Discussion & questions (10min)
3. Solving Lead vs. Lead: Consistent Pronunciation for Web Content (20min): Talk: Sarah Wood (10min); Bio: read-aloud/TTS and dyslexia expertise and standards development; Abstract: Pronunciation in web content suffers from the lack of a standardized way to specify Synthetic Speech Markup Language in HTML. A standard would benefit assistive technologies, voice agents, and AI systems learning from web content. We present a recent solution from the EdTech standards community and seek discussion of approaches to more broadly ensure reliable, interoperable pronunciation for inclusive, accessible voice experiences across the web.; Discussion & questions (10min)
4. Hallucination in Automatic Speech Recognition Systems (20min): Talk: Bhiksha Raj (10min); Bio: Speech recognition and related areas; Abstract: We discuss the problem of hallucination in ASR systems, including their definition, description, quantification and mitigation.; Discussion & questions (10min)
5. Multi-Agent Conversational Methodology (25min): Talk: Emmett Coin (10min); Bio: https://www.linkedin.com/in/emmettcoin/; Abstract: As part of the Open Floor Protocol (OFP) standards group with the Linux Foundation, I am working on a common protocol for multi-agent conversational systems. We have demonstrated multi-agent coordination with a human participant and are working toward multi-agent and human conversational support that is fully collaborative across all participants. I will show how OFP works and why it is important.; Demo (5min); Discussion & questions (10min)
6. Reimagining Standards for Voice AI: Interoperability Without Sacrificing Innovation (20min): Talk: RJ Burnham (10min); Bio: RJ Burnham has spent more than two decades shaping the evolution of voice and conversational technology. He was a member of the W3C Voice Browser Working Group and later chaired the Call Control Working Group, contributing directly to the creation of VoiceXML and CCXML. As CTO at Voxeo, he helped drive the adoption of open standards and large-scale voice platforms. His later work in computer vision and neural networks gave him an early view into the foundations that eventually became modern large language models, which pulled him back into voice as President and CTO of Plum, where he helped build one of the first generative AI voice platforms. RJ is now the founder of Consig AI, where he’s applying the latest generation of voice AI and decades of experience to solve real problems in healthcare communication.; Abstract: Proprietary voice AI platforms emerged largely because the industry moved far beyond the directed-dialog world where VoiceXML excelled. As we shifted from structured dialogs to intent-based systems and now to LLM-driven agents, vendors built closed stacks to move quickly, but the result is fragmentation and lock-in. With voice systems becoming more capable and more central to real workflows, it’s worth asking whether the pendulum should swing back. What standards, if any, make sense in this new landscape? Can we design a shared foundation that restores the interoperability and portability we once had, without slowing innovation? This talk explores whether the time is right for a new generation of voice AI standards and what principles should guide them.; Discussion & questions (10min)
Break (10min)
7. Introduction to Breakout Groups (10min): Work mode, goals and expectations for the break out groups
8. Breakout Groups (45min): Working in breakout groups
9. Breakout Group Results (50min): Presentation of breakout group results (10min per group)
10. Wrap-up (10min): Summary and conclusion

Session 2: February 26, 2026, 10:00 GMT+1

Start times in various timezones:

PST (Pacific Standard Time): 01:00 AM
MST (Mountain Standard Time): 02:00 AM
EST (Eastern Standard Time): 04:00 AM
CET (Central European Time): 10:00 AM
ISR (Israel Standard Time): 11:00 AM
JST (Japan Standard Time): 6:00 PM (18:00)

Agenda

1. Scene setting (10min): Goals and expectations for the workshop
2. Towards Smarter Voice Interfaces: Using Grounding and Knowledge (20min): Talk: Kristiina Jokinen (10min); Bio: https://blogs.helsinki.fi/kjokinen/; Abstract: In this talk I will discuss the design and development of trustworthy GenAI-based applications, and in particular, focus on grounding (i.e. anchoring conversations, perceptions, and knowledge in a shared context) as a key principle of spoken interaction management. I will review challenges, opportunities, and lessons learnt in creating accountable reasoning and fluent natural dialogue between Smart AI agents and users, to support long-term interactions and trust for responsible and safe AI agents.; Discussion & questions (10min)
3. Accessibility of 3D and Immersive Content via Voice Interaction (25min): Talk: Zohar Gan (10min); Bio: Solutions architect and software developer researching accessibility of 3D content for people with disabilities (as a social impact project). Presented at the W3C Workshop on Inclusive Design for Immersive Web Standards in 2019. As follow up later published the semantic-xr MIT-licensed GitHub repository. Recently, built a web-based voice-powered proof of concept demonstrating multiple 3D content accessibility solutions.; Abstract: Focused on people with disabilities, assistive-tech developers and content creators, this talk envisions a more voice-accessible 3D web content (videos, XR etc.) using semantic 3D metadata. Beyond AI-only descriptions, rich metadata like names, hierarchy, hidden elements and instructions can greatly improve access. The talk features a voice-powered demo based on https://github.com/techscouter/semantic-xr and proposes standardizing: a semantic 3D metadata schema, its embedding in media and web spatial voice.; Demo (5min); Discussion & questions (10min)
4. Gaze-Aware Dialog Systems (20min): Talk: Fares Abawi (10min); Bio: Multimodal social cue integration for robot gaze control; Abstract: Gaze provides informative cues for barge-in detection, turn-taking coordination, and reference resolution in dialog systems, yet commonly remains underutilized in current dialog system implementations. This talk examines how webcam-based gaze tracking can ground deictic expressions and anticipate conversational intent in web interfaces, and discusses practical integration methods using neural architectures that fuse gaze features with acoustic and language representations for multimodal dialog systems.; Discussion & questions (10min)
5. Transition of Use Cases for Voice to LLM-based RAG or Agent setups in difficult scenarios (20min): Talk: Ulrike Stiefelhagen (10min); Bio: project manager voice projects; Abstract: Handling Hallucinations in Voice agents might be even trickier than in textual Chatbots. Use Cases from Industry (Workers Daily Summary) and Health (Patient Chat) show that changed requirements in LLM-based systems may be a chance for voice to be included even in settings that are usually difficult for voice (privacy, and noise, respectively).; Discussion & questions (10min)
6. Towards Web Standards for Configurable Naturally Responsive Voice Interaction for AI Agents (25min): Talk: Paola Di Maio (10min); Bio: Paola Di Maio, PhD is a systems engineer with research experience spanning neurosymbolic AI, knowledge representation, and human-AI co-evolution. Chair of the W3C AI Knowledge Representation Community Group, she participates in the development of web standards for AI systems. Her research focuses on the intersection of cognitive science and intelligent systems design.; Abstract: Voice agents are in place, but current designs fundamentally misunderstand how humans think and communicate. We have the technology, but we lack usability standards that would make voice interfaces truly work for users. Current voice AI systems suffer from critical UX failures that break natural conversation flow and violate fundamental principles of cognitive respect. Through systematic documentation of real-time voice interactions with state-of-the-art conversational AI, I have identified critical gaps that prevent voice agents from supporting how humans actually think and communicate. This talk presents empirical findings based on real use cases, will attempt to include a recorded demo and gather consideration toward development of a web standard for user friendly voice AI; Demo (5min); Discussion & questions (10min)
Break (10min)
7. Introduction to Breakout Groups (10min): Work mode, goals and expectations for the break out groups
8. Breakout Groups (45min): Working in breakout groups
9. Breakout Group Results (50min): Presentation of breakout group results (10min per group)
10. Wrap-up (10min): Summary and conclusion

Session 3: February 27, 2026, 12:00 GMT-5

Start times in various timezones:

PST (Pacific Standard Time): 09:00 AM
MST (Mountain Standard Time): 10:00 AM
EST (Eastern Standard Time): 12:00 PM
CET (Central European Time): 6:00 PM (18:00)
ISR (Israel Standard Time): 7:00 PM (19:00)
JST (Japan Standard Time): 2:00 AM (Next Day)

Agenda

1. Scene setting (10min): Goals and expectations for the workshop
2. Do we need real-time processing capabilities on voice agents? (20min): Talk: Casey Kennington (10min); Bio: Dialogue systems research since 2011.; Abstract: Alexa, Siri, and other voice agents play a game of verbal ping-pong with their users: human utters a wake word, they speak something, then the agent processes. Humans don't play verbal ping-pong in that they are constantly listening and updating their understanding in real-time. "Incremental" agents process inputs and outputs in real-time, at the word level instead of waiting for an utterance to finish before responding. This introduces technical challenges, but also makes the agents more natural. In my talk, I will explore the need and requirements for building real-time processing into agents.; Discussion & questions (10min)
3. Voice Agents for In-Vehicle Interaction (20min): Talk: Frankie James (10min); Bio: Automotive industry consultant/HCI Researcher; Abstract: The in-cabin experience for passenger vehicles continues to increase in complexity as auto manufacturers add more features. However, most people only use a small fraction of their vehicle’s functionality, either because they do not know how to access certain features or are not aware of what features are available. This talk will discuss the use of voice agents to unlock hidden vehicle features, along with potential pitfalls and issues related to the technology.; Discussion & questions (10min)
4. Trust & Empathy with Multimodal Assistants (20min): Talk: Raj Tumuluri (10min); Bio: https://dblp.org/pid/154/2852.html; Abstract: This talk will explore the gap between functional performance and emotional resonance in modern AI. We will delve into key design principles for engineering cognitive empathy (the perceived understanding of user state/intent) and trustworthiness in systems handling complex inputs. Key topics include: tone consistency across modalities, handling ambiguity and error states gracefully, and visual/aural design choices that signal reliability and care. The presentation will conclude with a practical framework.; Discussion & questions (10min)
5. Beyond Screen Readers: Standardizing Embeddable Voice Agents for Universal Web Accessibility (20min): Talk: Bryan Vuong (10min); Bio: Bryan Vuong is the CTO of InnoSearch AI, a company dedicated to developing technologies that make the Web more accessible and inclusive. He leads the technical strategy for CoBrowse AI, focusing on bridging the gap between visual web interfaces and non-visual access. Bryan holds a Ph.D. in Computer Science from the University of Wisconsin-Madison. He brings extensive industry experience to the discussion on standardization, having previously held engineering and product leadership roles at Google, Meta, and Walmart e-commerce.; Abstract: While screen readers have long been the standard for non-visual web access, they often present a steep learning curve and lack conversational context. This talk presents a case study of CoBrowse AI, an embeddable voice agent designed to layer voice interaction over existing websites. We will demonstrate how LLM-driven voice agents can bridge the gap between visual interfaces and blind/low-vision users, and critically, we will identify the specific Web Standard gaps that currently hinder the seamless integration of such third-party agents into the DOM.; Discussion & questions (10min)
Break (10min)
6. Introduction to Breakout Groups (10min): Work mode, goals and expectations for the break out groups
7. Breakout Groups (45min): Working in breakout groups
8. Breakout Group Results (40min): Presentation of breakout group results (10min per group)
9. Wrap-up (10min): Summary and conclusion