(MEETING TITLE) – 25 February 2026

Meeting minutes

https://www.w3.org/policies/code-of-conduct/

3. Solving Lead vs. Lead: Consistent Pronunciation for Web Content - Sarah Wood

<dirk> Resent instructions via email to Patricia Lee to check if she did not recive them

<Frankie> should we submit questions here?

sarah: (describes the importance of standardized way to specify pronunciation in Web contents for assistive purposes)

<plh> Frankie, raise your hand on zoom

<Frankie> I'm not finding the button for that - sorry, more of a Google Meet user

<Frankie> okay got it

br: cost for that purpose?

sw: can provide examples by email

fj: question about localization
… e.g., navigation on automotives
… how to handle that?
… several possible pronunciations

sw: could see some algorithm
… based on geographical areas
… local pronunciation like this

jl: simple solution using dictionaries
… for applications
… each application can resolve the pronunciation

sw: sounds a reasonable solution

jl: don't think one single solution would fit all the possible cases

sw: need a mechanism for author control

rj: long-term problem there
… how to represent using phonemes
… how modern speech synthesis engines handle it
… this is the real major problem
… the problem is specifications to represent it
… my point is not so much specifications but we need to find a way to specify that
… to train models

sw: completely agree
… if you have any suggestions, happy to hear that

dd: question from @@@

sw: not sure about ideal answers

<plh> Prabhu: What about derived languages like Hinglish? eg suffer safar

<plh> kaz: we used to have several workshops over i18n for ssml

<plh> ... ie ssml for various languages. my impression is that we should think about culture, location, background, and context for proper pronunciation.

<plh> Brian (on the chat): I will just note that the w3c web audio group has been taking up Web Speech again and it’s currently only dealing with STT, which it is achieving via a biasing dictionary. It will get around to TTS again and that would be a good place to align some of these comments and feedback to make sure that if possible solutions in one

<plh> area work in another well

Hallucination in Automatic Speech Recognition Systems - Bhiksha Raj

br: (describes problems around hallucination in ASR systems)
… (then summarizes the history of ASR technologies)
… (then lists current work on mitigation for hallucinations in ASR)

gc: related to noise of speech

br: rather related to mis-learning patterns
… could show details later
… specific patterns in the training data
… there are "noises" for training. that's right

<plh> kaz: from tech/research vient point, it's interesting. but what is expecting from the standard? do we need standardize way to identify those kinds of errors?

<plh> br: yes, we need some benchmarking. if I give the same output to 10 individuals, maybe 8 will say it's hallucinated. we need to formalize

<plh> ec: this is great. when using an llm, we could limit its scope, ie giving it a model. prompts augment llm, a prompt grammar can help.

<plh> br: those tends to become very large

<plh> ... having to feed that to an llm would be beyond the capacity of moderm llms

<plh> ec: giving a sketch/output might be enough

<plh> ... would limit mistakes

<plh> Matt: whipser has multilangual capabilities, did you find issues with multilingual?

<plh> br: yes

<plh> dirk: can we detect in realtime?

<plh> br: can take a second to process. realtime tends to bring issue because you're dealing with chunks.

<plh> ... and chunks might not have complete words or phrases.

<plh> fares: on hallucination errors, you used an other language and you assessed the output.

<plh> br: we tested against human annotated data

<plh> br: in the 3 way classifications, 2 humans agree most of the time.

<plh> fares: ok, makes sense, like a helllo score.

Multi-Agent Conversational Methodology - Emmett Coin

<LisaNMichaud> This was interesting to me as I've had a lot of experience lately with TTS hallucination, and here we have ASR hallucination. When you think about it - an E2E system has so many opportunities for things to take an interesting direction in a single conversational turn!

ec: (describes mechanisms around common protocol for multi-agent conversational systems)
… (then shows a demo with multiple agents and a participant)

<plh> JimLarson: How does Open Floor Protocol relate to google’s A2A protocol? Are they complementary or competitive?

<plh> ec: they are different.

<plh> ... this is live interactive, communal, time saving approach.

<plh> jl: if I use the google system, does it prohibit me from using the other system.

<plh> ec: if it's an agent, you can connect it to anything you want.

<plh> YashGelana: The ‘floor’ can essentially live at the operating system level, where each OS is “infused” with many intelligent agents (each specializing in certain actions and equipped with the required tools) and the user is interacting with agents (either one at a time, or commanding multiple agents at a time if a task requires multi-agent

<plh> co-ordination)

<plh> ec: the floor is also a server that lives on the internet

<plh> ... you can invite agents

<plh> ... it would be like a hosting service

<plh> ... if it's in the OS, it becomes similar like the google thing. this is open, like web pages.

<plh> kaz (in the chat): (can wait) very exciting ideas/approach. wondering which part to be standardized for the Web technology. protocol, api or might be dialog management model?

<plh> mn: security/privacy of the data, between me and the agent, or between agents

<plh> ec: using https as a start

<plh> ... there is a way in the protocol to send a private msg

<plh> ... we have the idea of obfuscating things.

<plh> ... so that the agents can still understand some of the context

<plh> rj: have you looked at the work within the ietf, vcon

<plh> ec: nope

https://datatracker.ietf.org/group/vcon/about/ Virtualized Conversations (vcon)

<plh> rj: vcon is meant to capture the entire context of multiple conversations

<plh> ... they solve different problems but may be worth looking at it

<plh> ... privacy, redaction, verification of data, compliance, etc.

<plh> ec: interesting indeed. summarizing the context

Reimagining Standards for Voice AI: Interoperability Without Sacrificing Innovation - RJ Burnham

rj: (describes the recent paradigm shift from directed-dialog world to LLM-driven agents)
… (then summarizes the current practices like MCP, A2A, and AGENTS.md but they're proprietary technologies)
… (missing the flow layer, which covers portable interchange representation for voice agents)
… (what standards could help for that?)
… (shows examples)
… (need to evangelize the industries)

dd: question around knowledge graphs

gc: some experience for dialog systems

dd: questions from Yash on the Chat
… I find that thinking of LLM applications as POMDPs is a better mental model when designing conversational agents (voice or otherwise)
… A mental model I like to adopt: Tell voice agents what needs to be achieved, not how to go about achieving it

rj: what the cost would be
… humans do that all the time
… we have to handle that carefully
… should be prescriptive

Governance and Greenlights: Leveraging the '3 Ps' to Standardize Trust, Scale, and Usability in Voice Agent Web Integration - Patricia Lee

<plh> (recording)

dd: unfortunately, we don't have the speaker here but can get questions on chat
… then 10-minute break

[10 minutes break]

Breakout room assignment

Solving Lead vs. Lead: Consistent Pronunciation for WebContent: Zoom Room 2

Hallucination in Automatic Speech Recognition Systems: Zoom Room 3

Multi-Agent Conversational Methodology: Zoom Room 4

Reimagining Standards for Voice AI: Interoperability Without Sacrificing Innovation: Zoom Room 5

Breakout discussions

Plenary discussion again

dd: we can give 15 mins for each breakout group to make report

Solving Lead vs. Lead: Consistent Pronunciation for Web Content - Sarah Wood

sw: didn't generate the official slide deck but had discussion
… we're not solving the same problem for all the languages

fj: some of the walkaround like pronunciation dictionary

sw: issues about misspelling

ec: demonstration of paragraph
… you can read it if some of the characters are exchanges with each other

sw: W3C had pronunciation tf before
… we don't have clear view at the moment

dd: it was a good summary by the tf
… what to do the next
… maybe there should be at least a CG

sw: AI has been changed across devices

ec: notation about pronunciation
… TTS used to just "read" the text
… wondering if we could think about simple solution

sw: AI is pretty good at handling phonetics, so one possible solution

dirk: forum within W3C to continue the discussion?

bk: there is a Web Speech CG
… completely aligned with the universal SSML support

<plh>

bk: also multiple support by MS, Google, Igalia
… taking TTS once it's got Chartered, probably next year
… different supports like HTML, ARIA, etc.

dd: how long do we have?

dirk: 10 more minutes?

dd: wondering how much prounciation issiues
… a lot of languages have hints
… so wondering differing languages have different issues

sw: also issues on proper names and abbreviations

dd: right
… and Spanish has a lot of dialects

sw: local languages even within dialects

jl: how words are pronounced in France, etc.
… lexicon should have information about that

sw: right

jl: different pronunciation by different people

sw: right
… teachers restrict the variations

jl: customization setting

kaz: so we need to think about who from which area is speaking as well as the language itself

Reimagining Standards for Voice AI: Interoperability Without Sacrificing Innovation - RJ Burnham

rj: how people support pronunciation and SSML
… 80% of the interface of the AI agents are same with each other
… there is no architectural mismatch
… but we would remove the remaining pain
… the bigger question is the value
… and complexity of dialogues
… (reflects the previous issues at the time of VoiceXML)
… what is the input and output
… a lot of context there
… this is one of the pain points
… there is not consensus yet
… probably more exploratory work is needed
… would be challenging to manage the complexity

sw: exactly we want different browsers
… any suggestions?

rj: good question
… this type of problem is hard to solve in a standardized manner
… engineers from key stakeholders to discuss with each other in a common way
… very hard to get consensus
… need a champion

ec: agree we need the next step of VoiceXML
… how to define the situation for generalization?

rj: very much the case

ec: what is the simple way to specify this?

dd: there is a mechanism of CG
… also breakout session during TPAC

plh: yeah
… fyi, there is the breakouts day in March too
… don't have to wait until TPAC

https://github.com/w3c/breakouts-day-2026/issues

Multi-Agent Conversational Methodology - Emmett Coin

breakout minutes

ec: nice discussion about cultural differences
… automatically differing speakers
… standardizing languages like the ESL level
… wide range of age, culture, etc.
… also ideas of interaction layer of behavior

se: conversation patterns
… would get expertise on legality, etc., from agents

ec: excellent
… in simple way, only speak to one person
… but some one talks with one agent, and another agent can join the conversation
… simple and rule-based approach is possible
… simple interaction
… we talked about a layer of mentality also
… for various generations
… we could add those points to make the conversation smoother

dd: very different kind of agents?

ec: do you want every agent to be differentiated?

kaz: for that purpose, we might want to think about some dialogue management model
… also we could explicitly characterize each agent one by one, e.g., a funny agent and a diligent agent

ec: agree
… maybe we could have some manifest and persona for that purpose
… in some case, we need a serous guy without any jokes

dd: we used to have persona designers

ec: advertise it and other people can see that
… could have prototype design in some way

us: would rather see one for some specific viewpoint
… but can't see the usefulness to have 5 different personalities one by one

ec: there could be a personality for each agent

fj: probably good to have one from psychological viewpoint

ec: different services could be provided by different agents one by one, train, hotel, etc.

us: it's about "trust", I think
… the aspect of the task
… if you go back to the user
… I trust one from some specific service

ec: could think about various possibilities

dd: one use case could be considered is
… would it be strange to ask one specific agent to handle different tasks?

ec: from implementation viewpoint, much easier to let one agent handle one occasion

Tomorrow

Session 2

dd: will have the next session tomorrow
… 10 mins for each presentation

ec: the content on Zoom chat is useful

dd: could be recorded, I think

[Session 1 adjourned]

– DRAFT –
(MEETING TITLE)

25 February 2026

Attendees

Meeting minutes

3. Solving Lead vs. Lead: Consistent Pronunciation for Web Content - Sarah Wood

Hallucination in Automatic Speech Recognition Systems - Bhiksha Raj

Multi-Agent Conversational Methodology - Emmett Coin

Reimagining Standards for Voice AI: Interoperability Without Sacrificing Innovation - RJ Burnham

Governance and Greenlights: Leveraging the '3 Ps' to Standardize Trust, Scale, and Usability in Voice Agent Web Integration - Patricia Lee

Breakout room assignment

Breakout discussions

Plenary discussion again

Solving Lead vs. Lead: Consistent Pronunciation for Web Content - Sarah Wood

Reimagining Standards for Voice AI: Interoperability Without Sacrificing Innovation - RJ Burnham

Multi-Agent Conversational Methodology - Emmett Coin

Tomorrow

Diagnostics