W3C

W3C Workshop on Smart Voice Agents

Virtual, 25-27 February 2026

This workshop was organized as a fully virtual W3C event with talks, online discussions, and interactive sessions.

See also the workshop agenda for session details and recordings.

Workshop Report

Table of Contents

Executive Summary

The Smart Voice Agents workshop brought together voice platform providers, agent developers, privacy experts, accessibility advocates, and standards professionals to advance interoperability and user empowerment in voice-enabled systems.

Workshop participants acknowledged the growing ubiquity of voice agents across devices and platforms, and identified key challenges in achieving seamless, secure, and privacy-respecting interactions across different voice ecosystems. They highlighted the need for standardized protocols for agent-to-agent communication, mechanisms for user consent and delegation, and frameworks for ensuring transparency in multi-agent conversations.

Key discussion areas included:

On top of individual topics, one of the suggested next steps is to explore the possible creation of a voice agents activity at W3C to coordinate inputs from the voice community, pursue broader discussions on interoperability and privacy, and track progress on needs identified during the workshop.

The following quote from RJ Burnham, summarizes a central issue:

"Proprietary voice AI platforms can move quickly, but the result is fragmentation and lock-in. The key question is whether we can restore portability and interoperability without slowing innovation."

— RJ Burnham, Session 1

Editorial summary: Across all sessions and breakouts, participants repeatedly converged on eight cross-cutting issues that now define the practical standards agenda: pronunciation and language representation, hallucination reliability, real-time interaction quality, interoperability scope, privacy and delegation boundaries, multimodal synchronization, immersive accessibility integration, and cultural/persona adaptation. These priorities are consolidated in the Top 8 Cross-cutting Issues section and can serve as a roadmap for near-term Community Group and other W3C discussions.

Introduction

W3C holds workshops to gain insights from different perspectives, identify needs that could warrant standardization at W3C or elsewhere, and assess support and priorities among affected communities. The goal for the Workshop on Smart Voice Agents was to bring together voice platform providers, agent developers, privacy experts, accessibility advocates, and standards professionals to advance interoperability and user empowerment in voice-enabled systems.

The workshop convened a diverse group of participants representing different facets of voice technology. Discussions covered a wide range of topics, from high-level architectural questions about agent discovery and delegation to specific technical issues such as real-time interaction, multimodal grounding, accessibility, trust, and interoperability across platforms.

Sessions

Session 1

Focus

Focused on trust, governance, and interoperability, including talks by Patricia Lee, Sarah Wood, Bhiksha Raj, Emmett Coin, and RJ Burnham.

Highlights

Session Details

Session 2

Focus

Focused on grounded interaction design and multimodal intelligence, with talks by Kristiina Jokinen, Zohar Gan, Fares Abawi, Ulrike Stiefelhagen, and Paola Di Maio.

Highlights

Session Details

Session 3

Focus

Focused on deployment realism and user trust, with talks by Casey Kennington, Frankie James, Raj Tumuluri, and Bryan Vuong.

Highlights

Session Details

Topics

Agent Interoperability

Interoperability was a core thread throughout Session 1. Emmett Coin ("Multi-Agent Conversational Methodology") presented work in the Open Floor Protocol (OFP) community on cross-agent collaboration between multiple agents and a human participant, with emphasis on coordinated turn-taking and shared conversational state.

RJ Burnham ("Reimagining Standards for Voice AI") argued that current closed ecosystems create portability and integration friction, and proposed revisiting standards strategy for modern LLM-driven voice systems. Patricia Lee ("Governance and Greenlights") added a process and governance lens, focusing on how trust and compliance requirements must be built into interoperability planning from the start.

Breakout discussions reinforced the need for interoperable interfaces that preserve innovation while reducing lock-in, especially for multi-agent orchestration and cross-vendor integration.

Communication Protocols and Standards

Security and privacy were addressed from governance and deployment angles. Patricia Lee highlighted cyber-resilient and compliant trust models as prerequisites for broad adoption. Ulrike Stiefelhagen discussed difficult scenarios in industry and health, noting that hallucination and reliability risks are amplified in voice-first experiences where users may infer higher confidence than warranted.

Session discussions also surfaced practical needs for secure defaults, transparent data handling, and clearer boundaries for what agents can do on behalf of users in sensitive domains.

User Control and Transparency

Several talks converged on user agency as a design requirement. Kristiina Jokinen ("Towards Smarter Voice Interfaces: Using Grounding and Knowledge") emphasized accountable reasoning grounded in shared context, while Raj Tumuluri ("Trust & Empathy with Multimodal Assistants") focused on explainable behavior during ambiguity and error states.

Casey Kennington's talk on real-time processing framed responsiveness as part of transparency: users better understand and trust systems that react incrementally and predictably rather than in opaque turn-by-turn blocks.

Multi-Agent Conversations

Multi-agent collaboration was directly addressed by Emmett Coin's OFP presentation, which demonstrated coordinated participation among multiple agents and a human in one conversation. The workshop treated delegation and handoff not only as routing problems but as interaction design problems involving turn ownership, context transfer, and participant awareness.

Complementary inputs from Fares Abawi ("Gaze-Aware Dialog Systems") suggested that multimodal signals can improve turn-taking and intent resolution in multi-party interactions, especially when voice alone is ambiguous.

Accessibility and Inclusivity

Accessibility was one of the strongest themes across sessions. Sarah Wood ("Solving Lead vs. Lead") discussed pronunciation consistency and argued for standardized speech markup support in web content to improve assistive and agent outcomes. Zohar Gan presented voice accessibility for 3D and immersive content using semantic metadata, with concrete proposals for standardizing metadata representation and integration.

Bryan Vuong ("Beyond Screen Readers") described embeddable voice agents for blind and low-vision users, identifying web platform gaps that currently limit seamless integration. Together, these talks emphasized that inclusive voice interaction depends on both semantic content standards and platform-level integration points.

Voice Biometrics and Authentication

While no dedicated biometrics talk was scheduled, identity, trust, and authorization concerns appeared repeatedly in adjacent discussions. Patricia Lee's trust-and-compliance framing and healthcare/industry scenarios discussed by Ulrike Stiefelhagen highlighted the need for stronger authentication and verification models in high-stakes voice workflows.

A recurring takeaway was to scope future standards work around privacy-preserving identity assertions, explicit consent and delegation boundaries, and auditable decision trails for agent actions taken on behalf of users.

Issues Identified

The following issues were repeatedly raised in plenary and breakout minutes and remain open for follow-up in Community Groups, breakouts, or formal standardization tracks.

Top 8 Cross-cutting Issues

  • Pronunciation and language representation: unresolved standards for phonetic markup, dialect variation, proper names, abbreviations, and author control.
  • Reliability and hallucination control: missing shared benchmarks and evaluation methods for ASR and ASR+LLM error modes across multilingual and noisy settings.
  • Real-time interaction quality: open problems in incremental processing, response timing, interruption behavior, and low-latency turn-taking.
  • Interoperability scope and architecture: no clear consensus yet on where to standardize first (protocol, API, dialog model, or integration profile).
  • Privacy, trust, and delegation boundaries: unresolved requirements for consent, identity assertions, redaction, verification, and auditable agent actions.
  • Multimodal coordination and synchronization: open questions around gaze/speech fusion, speaker diarization, intent inference, and multi-stream time alignment.
  • Accessibility in immersive and web contexts: gaps in semantic metadata, timing annotations, and practical integration hooks for assistive voice interaction.
  • Cultural, emotional, and persona adaptation: lack of interoperable models and guardrails for culturally aware behavior, emotion signaling, and safe agent personas.

25 February (Session 1)

Sources: Session 1 minutes and Multi-Agent Conversational Methodology breakout minutes.

  • Pronunciation and localization are unresolved: no one-size-fits-all approach across languages, dialects, proper names, and abbreviations; author control and shared representation mechanisms need further definition.
  • ASR hallucination handling lacks common metrics: participants asked for formal benchmarking and clearer definitions, with unresolved challenges for multilingual cases and real-time chunk-based processing.
  • Multi-agent standardization scope is still unclear: open questions remain on whether to prioritize protocol, API, or dialog-management layers, and how to align with related external efforts.
  • Cross-cultural and multilingual coordination is underspecified: breakout discussion raised unresolved needs for language mediation, culturally aware interaction styles, and shared dialog-management models for multi-agent plus human conversation.
  • Agent persona and intent signaling need guardrails: participants raised trust concerns about inconsistent or potentially malicious agent behavior and the lack of standard ways to describe agent purpose and attitude.
  • Privacy and trust boundaries need refinement: unresolved concerns include user-agent vs. agent-agent data protection, redaction, verification, and compliance in delegated workflows.
  • Interoperability consensus is incomplete: participants noted persistent complexity around input/output/context exchange and highlighted the need for champions and exploratory work before convergence.

Session Details

26 February (Session 2)

Sources: Session 2 minutes and Breakout 2 minutes (Accessibility of 3D and Immersive Content via Voice Interaction).

  • Accessibility for immersive environments needs standard hooks: discussion highlighted missing shared approaches for semantic metadata and integration of voice access in 3D and digital-twin contexts.
  • Real-time timing control remains difficult: stricter timing and interruption behavior for natural dialog, especially in industrial use cases, remains an open technical and standards problem.
  • Grounding formalization is not yet mature: participants called for more computable and reusable grounding structures so systems can exchange and apply contextual knowledge consistently.
  • Configurable responsive UX lacks clear standards: unresolved questions include speech end-point detection, interaction pacing, and adaptation to changing model/user variables.
  • Latency, privacy, and deployment architecture trade-offs are unresolved: breakout participants identified open design choices around cloud vs. local AI and hybrid execution for disability-focused response-time requirements.
  • Interaction timing metadata is not standardized: participants discussed dynamic timing and annotation needs for special audio behavior, but no shared format or integration profile was identified.
  • Gaze + multimodal fusion raises open questions: issues include how to infer intent reliably, measure usefulness, synchronize multiple data streams, and avoid fragile calibration assumptions.

Session Details

27 February (Session 3)

Sources: Session 3 main minutes and Breakout minutes.

  • Real-time processing for voice agents: Demonstrations and discussion on incremental, word-by-word speech processing, turn-taking, timing models, and the importance of modular, incremental dialogue processing (Casey Kennington, retico-team).
  • In-vehicle voice interaction: Challenges and research on voice agents in automotive environments, safety trade-offs, multimodal feedback, the evolution from physical controls to voice, and the need for on-board vs. off-board processing (Frankie James).
  • Trust & empathy in multimodal assistants: Engineering empathy, reliability models, "sentient" digital twins, and the importance of human-centric design for more natural interaction (Raj Tumuluri).
  • Web accessibility and embeddable voice agents: Standardizing voice agents for universal web accessibility, with a focus on blind and low-vision users, fast conversational interfaces, and DOM-based optimization (Bryan Vuong).
  • Breakout discussions: Explored incremental error handling, multimodal fusion, teachable moments in vehicles, privacy and voice fakes, on-board vs. off-board processing, collaborative approaches to agent interaction, and the need for robust speaker diarization and emotion/cultural adaptation models.
  • Standards and next steps: Emphasis on the need for W3C standardization in areas such as LLM APIs, multimodal fusion, timing, privacy, and streaming architectures. Noted the relevance of Community Groups and upcoming W3C events (Breakouts Day, TPAC 2026, possible journal special issue).

Session Details

Next Steps

The conversation does not end with the workshop. We encourage continued collaboration through Community Groups, upcoming W3C events, and publication opportunities that can carry these discussions into concrete standards and implementation work.

We Value Your Feedback

We are committed to learning from this experience. Whether you have positive highlights or constructive criticism regarding how the workshop was perceived or organized, please share your feedback.

Recordings & Materials

For those who wish to revisit a session or catch up on what they missed, recordings are available on the workshop agenda.

What to Watch (Technologies)

Based on workshop talks and breakout discussions, the following technology areas are likely to shape near-term progress for smart voice agents on the Web:

  • Multi-agent interoperability protocols: watch for practical protocol definitions and reference implementations that support handoff, shared context, and cross-vendor coordination.
  • Real-time incremental voice processing: watch for approaches that reduce turn latency and enable more natural, continuous interaction patterns.
  • Grounding and accountable reasoning: watch for methods that anchor responses in shared context and verifiable knowledge, especially in high-stakes domains.
  • Hallucination detection and mitigation: watch for evaluation frameworks and safeguards specific to ASR + LLM voice pipelines.
  • Multimodal interaction signals: watch integration of gaze and other non-verbal cues to improve turn-taking, intent resolution, and conversational robustness.
  • Accessibility-oriented semantics: watch progress on pronunciation markup, semantic 3D metadata, and embeddable voice interfaces for non-visual access.
  • Trust, consent, and governance models: watch for patterns that define clear user delegation boundaries, auditable agent actions, and privacy-preserving identity assertions.

Relevant W3C Working Groups and Interest Groups

Relevant W3C Community Groups

As highlighted in the workshop follow-up, the following W3C Community Groups are particularly relevant for continuing Smart Voice Agents work before formal standardization.

W3C Breakouts Day 2026

Breakouts Day is an annual virtual unconference where the global web community proposes and explores focused problems. Although the deadline for proposals has passed, participants can still join discussions and contribute to ongoing projects.

  • Date: 25-26 March 2026.
  • Event information: W3C Events.

TPAC 2026

TPAC is W3C's premier annual event where Working Groups and Community Groups coordinate and solve cross-cutting web platform challenges.

  • When: 26 October – 30 October 2026.
  • Details: TPAC 2026.

Journal Publication

The organizers are exploring a Special Issue of an academic journal based on workshop themes. A formal Call for Papers is expected, and workshop speakers will receive a dedicated invitation when details are finalized.

  • Proposed theme focus: interoperable, real-time, multimodal, and inclusive smart voice agents.
  • Status: planning in progress; formal announcement to follow.

Thank You!

Thank you for your interest in the W3C Workshop on Smart Voice Agents. Over three half-day sessions, the workshop explored a rapidly evolving landscape and benefited from strong engagement across talks, plenary exchanges, and breakout discussions.

From Foundations and Interoperability on Day 1, to Smarter and More Inclusive Interactions on Day 2, and Real-time, Contextual, and Applied Voice Agents on Day 3, the breadth of insights shared was remarkable.

On behalf of the organizing committee, thanks to the speakers for contributing their expertise and to all attendees for thoughtful and energetic discussions. We look forward to building a smarter, more interoperable voice ecosystem.

W3C is proud to be an open and inclusive organization, focused on productive discussions and actions. Our Code of Conduct ensures that all voices can be heard.

Suggestions for improving this workshop page, such as fixing typos or adding specific topics, can be made by opening a pull request on GitHub.