Workshop Report
Table of Contents
Executive Summary
The Smart Voice Agents workshop brought together voice platform providers, agent developers, privacy experts, accessibility advocates, and standards professionals to advance interoperability and user empowerment in voice-enabled systems.
Workshop participants acknowledged the growing ubiquity of voice agents across devices and platforms, and identified key challenges in achieving seamless, secure, and privacy-respecting interactions across different voice ecosystems. They highlighted the need for standardized protocols for agent-to-agent communication, mechanisms for user consent and delegation, and frameworks for ensuring transparency in multi-agent conversations.
Key discussion areas included:
- Agent discovery and invocation mechanisms that respect user privacy and choice
- Protocols for delegating conversation control between agents (conversation handoff)
- Privacy-preserving authentication and user identification across agents
- Accessibility requirements for voice interfaces and multi-modal experiences
- Technical standards for voice agent interoperability
On top of individual topics, one of the suggested next steps is to explore the possible creation of a voice agents activity at W3C to coordinate inputs from the voice community, pursue broader discussions on interoperability and privacy, and track progress on needs identified during the workshop.
The following quote from RJ Burnham, summarizes a central issue:
"Proprietary voice AI platforms can move quickly, but the result is fragmentation and lock-in. The key question is whether we can restore portability and interoperability without slowing innovation."
— RJ Burnham, Session 1
Editorial summary: Across all sessions and breakouts, participants repeatedly converged on eight cross-cutting issues that now define the practical standards agenda: pronunciation and language representation, hallucination reliability, real-time interaction quality, interoperability scope, privacy and delegation boundaries, multimodal synchronization, immersive accessibility integration, and cultural/persona adaptation. These priorities are consolidated in the Top 8 Cross-cutting Issues section and can serve as a roadmap for near-term Community Group and other W3C discussions.
Introduction
W3C holds workshops to gain insights from different perspectives, identify needs that could warrant standardization at W3C or elsewhere, and assess support and priorities among affected communities. The goal for the Workshop on Smart Voice Agents was to bring together voice platform providers, agent developers, privacy experts, accessibility advocates, and standards professionals to advance interoperability and user empowerment in voice-enabled systems.
The workshop convened a diverse group of participants representing different facets of voice technology. Discussions covered a wide range of topics, from high-level architectural questions about agent discovery and delegation to specific technical issues such as real-time interaction, multimodal grounding, accessibility, trust, and interoperability across platforms.
Sessions
Session 1
Focus
Focused on trust, governance, and interoperability, including talks by Patricia Lee, Sarah Wood, Bhiksha Raj, Emmett Coin, and RJ Burnham.
Highlights
- Governance frameworks for standardization
- Pronunciation consistency for web voice systems
- ASR hallucination mitigation
- Open Floor Protocol multi-agent collaboration
Session 2
Focus
Focused on grounded interaction design and multimodal intelligence, with talks by Kristiina Jokinen, Zohar Gan, Fares Abawi, Ulrike Stiefelhagen, and Paola Di Maio.
Highlights
- Context-grounded trustworthy voice agents
- Accessibility for immersive/3D content
- Gaze-aware dialog
- Configurable standards for naturally responsive voice interaction
Session 3
Focus
Focused on deployment realism and user trust, with talks by Casey Kennington, Frankie James, Raj Tumuluri, and Bryan Vuong.
Highlights
- Real-time incremental processing
- In-vehicle voice interaction
- Multimodal trust and empathy design
- Embeddable voice agents for web accessibility
Topics
Agent Interoperability
Interoperability was a core thread throughout Session 1. Emmett Coin ("Multi-Agent Conversational Methodology") presented work in the Open Floor Protocol (OFP) community on cross-agent collaboration between multiple agents and a human participant, with emphasis on coordinated turn-taking and shared conversational state.
RJ Burnham ("Reimagining Standards for Voice AI") argued that current closed ecosystems create portability and integration friction, and proposed revisiting standards strategy for modern LLM-driven voice systems. Patricia Lee ("Governance and Greenlights") added a process and governance lens, focusing on how trust and compliance requirements must be built into interoperability planning from the start.
Breakout discussions reinforced the need for interoperable interfaces that preserve innovation while reducing lock-in, especially for multi-agent orchestration and cross-vendor integration.
Communication Protocols and Standards
Security and privacy were addressed from governance and deployment angles. Patricia Lee highlighted cyber-resilient and compliant trust models as prerequisites for broad adoption. Ulrike Stiefelhagen discussed difficult scenarios in industry and health, noting that hallucination and reliability risks are amplified in voice-first experiences where users may infer higher confidence than warranted.
Session discussions also surfaced practical needs for secure defaults, transparent data handling, and clearer boundaries for what agents can do on behalf of users in sensitive domains.
User Control and Transparency
Several talks converged on user agency as a design requirement. Kristiina Jokinen ("Towards Smarter Voice Interfaces: Using Grounding and Knowledge") emphasized accountable reasoning grounded in shared context, while Raj Tumuluri ("Trust & Empathy with Multimodal Assistants") focused on explainable behavior during ambiguity and error states.
Casey Kennington's talk on real-time processing framed responsiveness as part of transparency: users better understand and trust systems that react incrementally and predictably rather than in opaque turn-by-turn blocks.
Multi-Agent Conversations
Multi-agent collaboration was directly addressed by Emmett Coin's OFP presentation, which demonstrated coordinated participation among multiple agents and a human in one conversation. The workshop treated delegation and handoff not only as routing problems but as interaction design problems involving turn ownership, context transfer, and participant awareness.
Complementary inputs from Fares Abawi ("Gaze-Aware Dialog Systems") suggested that multimodal signals can improve turn-taking and intent resolution in multi-party interactions, especially when voice alone is ambiguous.
Accessibility and Inclusivity
Accessibility was one of the strongest themes across sessions. Sarah Wood ("Solving Lead vs. Lead") discussed pronunciation consistency and argued for standardized speech markup support in web content to improve assistive and agent outcomes. Zohar Gan presented voice accessibility for 3D and immersive content using semantic metadata, with concrete proposals for standardizing metadata representation and integration.
Bryan Vuong ("Beyond Screen Readers") described embeddable voice agents for blind and low-vision users, identifying web platform gaps that currently limit seamless integration. Together, these talks emphasized that inclusive voice interaction depends on both semantic content standards and platform-level integration points.
Voice Biometrics and Authentication
While no dedicated biometrics talk was scheduled, identity, trust, and authorization concerns appeared repeatedly in adjacent discussions. Patricia Lee's trust-and-compliance framing and healthcare/industry scenarios discussed by Ulrike Stiefelhagen highlighted the need for stronger authentication and verification models in high-stakes voice workflows.
A recurring takeaway was to scope future standards work around privacy-preserving identity assertions, explicit consent and delegation boundaries, and auditable decision trails for agent actions taken on behalf of users.
Issues Identified
The following issues were repeatedly raised in plenary and breakout minutes and remain open for follow-up in Community Groups, breakouts, or formal standardization tracks.
Top 8 Cross-cutting Issues
- Pronunciation and language representation: unresolved standards for phonetic markup, dialect variation, proper names, abbreviations, and author control.
- Reliability and hallucination control: missing shared benchmarks and evaluation methods for ASR and ASR+LLM error modes across multilingual and noisy settings.
- Real-time interaction quality: open problems in incremental processing, response timing, interruption behavior, and low-latency turn-taking.
- Interoperability scope and architecture: no clear consensus yet on where to standardize first (protocol, API, dialog model, or integration profile).
- Privacy, trust, and delegation boundaries: unresolved requirements for consent, identity assertions, redaction, verification, and auditable agent actions.
- Multimodal coordination and synchronization: open questions around gaze/speech fusion, speaker diarization, intent inference, and multi-stream time alignment.
- Accessibility in immersive and web contexts: gaps in semantic metadata, timing annotations, and practical integration hooks for assistive voice interaction.
- Cultural, emotional, and persona adaptation: lack of interoperable models and guardrails for culturally aware behavior, emotion signaling, and safe agent personas.
25 February (Session 1)
Sources: Session 1 minutes and Multi-Agent Conversational Methodology breakout minutes.
- Pronunciation and localization are unresolved: no one-size-fits-all approach across languages, dialects, proper names, and abbreviations; author control and shared representation mechanisms need further definition.
- ASR hallucination handling lacks common metrics: participants asked for formal benchmarking and clearer definitions, with unresolved challenges for multilingual cases and real-time chunk-based processing.
- Multi-agent standardization scope is still unclear: open questions remain on whether to prioritize protocol, API, or dialog-management layers, and how to align with related external efforts.
- Cross-cultural and multilingual coordination is underspecified: breakout discussion raised unresolved needs for language mediation, culturally aware interaction styles, and shared dialog-management models for multi-agent plus human conversation.
- Agent persona and intent signaling need guardrails: participants raised trust concerns about inconsistent or potentially malicious agent behavior and the lack of standard ways to describe agent purpose and attitude.
- Privacy and trust boundaries need refinement: unresolved concerns include user-agent vs. agent-agent data protection, redaction, verification, and compliance in delegated workflows.
- Interoperability consensus is incomplete: participants noted persistent complexity around input/output/context exchange and highlighted the need for champions and exploratory work before convergence.
26 February (Session 2)
Sources: Session 2 minutes and Breakout 2 minutes (Accessibility of 3D and Immersive Content via Voice Interaction).
- Accessibility for immersive environments needs standard hooks: discussion highlighted missing shared approaches for semantic metadata and integration of voice access in 3D and digital-twin contexts.
- Real-time timing control remains difficult: stricter timing and interruption behavior for natural dialog, especially in industrial use cases, remains an open technical and standards problem.
- Grounding formalization is not yet mature: participants called for more computable and reusable grounding structures so systems can exchange and apply contextual knowledge consistently.
- Configurable responsive UX lacks clear standards: unresolved questions include speech end-point detection, interaction pacing, and adaptation to changing model/user variables.
- Latency, privacy, and deployment architecture trade-offs are unresolved: breakout participants identified open design choices around cloud vs. local AI and hybrid execution for disability-focused response-time requirements.
- Interaction timing metadata is not standardized: participants discussed dynamic timing and annotation needs for special audio behavior, but no shared format or integration profile was identified.
- Gaze + multimodal fusion raises open questions: issues include how to infer intent reliably, measure usefulness, synchronize multiple data streams, and avoid fragile calibration assumptions.
27 February (Session 3)
Sources: Session 3 main minutes and Breakout minutes.
- Real-time processing for voice agents: Demonstrations and discussion on incremental, word-by-word speech processing, turn-taking, timing models, and the importance of modular, incremental dialogue processing (Casey Kennington, retico-team).
- In-vehicle voice interaction: Challenges and research on voice agents in automotive environments, safety trade-offs, multimodal feedback, the evolution from physical controls to voice, and the need for on-board vs. off-board processing (Frankie James).
- Trust & empathy in multimodal assistants: Engineering empathy, reliability models, "sentient" digital twins, and the importance of human-centric design for more natural interaction (Raj Tumuluri).
- Web accessibility and embeddable voice agents: Standardizing voice agents for universal web accessibility, with a focus on blind and low-vision users, fast conversational interfaces, and DOM-based optimization (Bryan Vuong).
- Breakout discussions: Explored incremental error handling, multimodal fusion, teachable moments in vehicles, privacy and voice fakes, on-board vs. off-board processing, collaborative approaches to agent interaction, and the need for robust speaker diarization and emotion/cultural adaptation models.
- Standards and next steps: Emphasis on the need for W3C standardization in areas such as LLM APIs, multimodal fusion, timing, privacy, and streaming architectures. Noted the relevance of Community Groups and upcoming W3C events (Breakouts Day, TPAC 2026, possible journal special issue).
Next Steps
The conversation does not end with the workshop. We encourage continued collaboration through Community Groups, upcoming W3C events, and publication opportunities that can carry these discussions into concrete standards and implementation work.
We Value Your Feedback
We are committed to learning from this experience. Whether you have positive highlights or constructive criticism regarding how the workshop was perceived or organized, please share your feedback.
- Email your thoughts to: group-voiceagents-ws-pc@w3.org
Recordings & Materials
For those who wish to revisit a session or catch up on what they missed, recordings are available on the workshop agenda.
What to Watch (Technologies)
Based on workshop talks and breakout discussions, the following technology areas are likely to shape near-term progress for smart voice agents on the Web:
- Multi-agent interoperability protocols: watch for practical protocol definitions and reference implementations that support handoff, shared context, and cross-vendor coordination.
- Real-time incremental voice processing: watch for approaches that reduce turn latency and enable more natural, continuous interaction patterns.
- Grounding and accountable reasoning: watch for methods that anchor responses in shared context and verifiable knowledge, especially in high-stakes domains.
- Hallucination detection and mitigation: watch for evaluation frameworks and safeguards specific to ASR + LLM voice pipelines.
- Multimodal interaction signals: watch integration of gaze and other non-verbal cues to improve turn-taking, intent resolution, and conversational robustness.
- Accessibility-oriented semantics: watch progress on pronunciation markup, semantic 3D metadata, and embeddable voice interfaces for non-visual access.
- Trust, consent, and governance models: watch for patterns that define clear user delegation boundaries, auditable agent actions, and privacy-preserving identity assertions.
Relevant W3C Working Groups and Interest Groups
- Web Accessibility Initiative (WAI): Standards and best practices for accessible web technologies, including voice interfaces.
- Web Platform Working Group: Responsible for core web standards that may impact agent integration.
- Privacy Interest Group (PING): Focuses on privacy issues and best practices for web technologies, including those affecting voice agents and user data.
- Web Applications Working Group: Develops APIs and technologies for web applications, relevant for agent interoperability.
- Web Audio Working Group: Develops standards for audio processing and synthesis on the web, essential for voice agent technologies.
- Web of Things (WoT) Working Group: Develops standards for integrating physical devices and sensors with the web, relevant for voice agents interacting with smart environments.
- Smart Cities Interest Group: Focuses on web standards and best practices for smart city applications, including interoperability and integration of IoT and voice technologies in urban environments.
Relevant W3C Community Groups
As highlighted in the workshop follow-up, the following W3C Community Groups are particularly relevant for continuing Smart Voice Agents work before formal standardization.
- Voice Interaction CG: focusing on the interface between users and voice-based applications.
- Autonomous Agents on the Web CG: exploring how agents navigate and act upon web content.
- AI Agent Protocol CG: defining how agents communicate with one another.
- Semantic 3D Content Accessibility CG: bridging voice and spatial computing.
- Start a new CG: participants with a new idea are encouraged to initiate new incubation efforts.
W3C Breakouts Day 2026
Breakouts Day is an annual virtual unconference where the global web community proposes and explores focused problems. Although the deadline for proposals has passed, participants can still join discussions and contribute to ongoing projects.
- Date: 25-26 March 2026.
- Event information: W3C Events.
TPAC 2026
TPAC is W3C's premier annual event where Working Groups and Community Groups coordinate and solve cross-cutting web platform challenges.
- When: 26 October – 30 October 2026.
- Details: TPAC 2026.
Journal Publication
The organizers are exploring a Special Issue of an academic journal based on workshop themes. A formal Call for Papers is expected, and workshop speakers will receive a dedicated invitation when details are finalized.
- Proposed theme focus: interoperable, real-time, multimodal, and inclusive smart voice agents.
- Status: planning in progress; formal announcement to follow.
Thank You!
Thank you for your interest in the W3C Workshop on Smart Voice Agents. Over three half-day sessions, the workshop explored a rapidly evolving landscape and benefited from strong engagement across talks, plenary exchanges, and breakout discussions.
From Foundations and Interoperability on Day 1, to Smarter and More Inclusive Interactions on Day 2, and Real-time, Contextual, and Applied Voice Agents on Day 3, the breadth of insights shared was remarkable.
On behalf of the organizing committee, thanks to the speakers for contributing their expertise and to all attendees for thoughtful and energetic discussions. We look forward to building a smarter, more interoperable voice ecosystem.
W3C is proud to be an open and inclusive organization, focused on productive discussions and actions. Our Code of Conduct ensures that all voices can be heard.
Suggestions for improving this workshop page, such as fixing typos or adding specific topics, can be made by opening a pull request on GitHub.