Meeting minutes
3. Solving Lead vs. Lead: Consistent Pronunciation for Web Content - Sarah Wood
<dirk> Resent instructions via email to Patricia Lee to check if she did not recive them
<Frankie> should we submit questions here?
sarah: (describes the importance of standardized way to specify pronunciation in Web contents for assistive purposes)
<plh> Frankie, raise your hand on zoom
<Frankie> I'm not finding the button for that - sorry, more of a Google Meet user
<Frankie> okay got it
br: cost for that purpose?
sw: can provide examples by email
fj: question about localization
… e.g., navigation on automotives
… how to handle that?
… several possible pronunciations
sw: could see some algorithm
… based on geographical areas
… local pronunciation like this
jl: simple solution using dictionaries
… for applications
… each application can resolve the pronunciation
sw: sounds a reasonable solution
jl: don't think one single solution would fit all the possible cases
sw: need a mechanism for author control
rj: long-term problem there
… how to represent using phonemes
… how modern speech synthesis engines handle it
… this is the real major problem
… the problem is specifications to represent it
… my point is not so much specifications but we need to find a way to specify that
… to train models
sw: completely agree
… if you have any suggestions, happy to hear that
dd: question from @@@
sw: not sure about ideal answers
<plh> Prabhu: What about derived languages like Hinglish? eg suffer safar
<plh> kaz: we used to have several workshops over i18n for ssml
<plh> ... ie ssml for various languages. my impression is that we should think about culture, location, background, and context for proper pronunciation.
<plh> Brian (on the chat): I will just note that the w3c web audio group has been taking up Web Speech again and it’s currently only dealing with STT, which it is achieving via a biasing dictionary. It will get around to TTS again and that would be a good place to align some of these comments and feedback to make sure that if possible solutions in one
<plh> area work in another well
Hallucination in Automatic Speech Recognition Systems - Bhiksha Raj
br: (describes problems around hallucination in ASR systems)
… (then summarizes the history of ASR technologies)
… (then lists current work on mitigation for hallucinations in ASR)
gc: related to noise of speech
br: rather related to mis-learning patterns
… could show details later
… specific patterns in the training data
… there are "noises" for training. that's right
<plh> kaz: from tech/research vient point, it's interesting. but what is expecting from the standard? do we need standardize way to identify those kinds of errors?
<plh> br: yes, we need some benchmarking. if I give the same output to 10 individuals, maybe 8 will say it's hallucinated. we need to formalize
<plh> ec: this is great. when using an llm, we could limit its scope, ie giving it a model. prompts augment llm, a prompt grammar can help.
<plh> br: those tends to become very large
<plh> ... having to feed that to an llm would be beyond the capacity of moderm llms
<plh> ec: giving a sketch/output might be enough
<plh> ... would limit mistakes
<plh> Matt: whipser has multilangual capabilities, did you find issues with multilingual?
<plh> br: yes
<plh> dirk: can we detect in realtime?
<plh> br: can take a second to process. realtime tends to bring issue because you're dealing with chunks.
<plh> ... and chunks might not have complete words or phrases.
<plh> fares: on hallucination errors, you used an other language and you assessed the output.
<plh> br: we tested against human annotated data
<plh> br: in the 3 way classifications, 2 humans agree most of the time.
<plh> fares: ok, makes sense, like a helllo score.
Multi-Agent Conversational Methodology - Emmett Coin
<LisaNMichaud> This was interesting to me as I've had a lot of experience lately with TTS hallucination, and here we have ASR hallucination. When you think about it - an E2E system has so many opportunities for things to take an interesting direction in a single conversational turn!
ec: (describes mechanisms around common protocol for multi-agent conversational systems)
… (then shows a demo with multiple agents and a participant)
<plh> JimLarson: How does Open Floor Protocol relate to google’s A2A protocol? Are they complementary or competitive?
<plh> ec: they are different.
<plh> ... this is live interactive, communal, time saving approach.
<plh> jl: if I use the google system, does it prohibit me from using the other system.
<plh> ec: if it's an agent, you can connect it to anything you want.
<plh> YashGelana: The ‘floor’ can essentially live at the operating system level, where each OS is “infused” with many intelligent agents (each specializing in certain actions and equipped with the required tools) and the user is interacting with agents (either one at a time, or commanding multiple agents at a time if a task requires multi-agent
<plh> co-ordination)
<plh> ec: the floor is also a server that lives on the internet
<plh> ... you can invite agents
<plh> ... it would be like a hosting service
<plh> ... if it's in the OS, it becomes similar like the google thing. this is open, like web pages.
<plh> kaz (in the chat): (can wait) very exciting ideas/approach. wondering which part to be standardized for the Web technology. protocol, api or might be dialog management model?
<plh> mn: security/privacy of the data, between me and the agent, or between agents
<plh> ec: using https as a start
<plh> ... there is a way in the protocol to send a private msg
<plh> ... we have the idea of obfuscating things.
<plh> ... so that the agents can still understand some of the context
<plh> rj: have you looked at the work within the ietf, vcon
<plh> ec: nope
https://
<plh> rj: vcon is meant to capture the entire context of multiple conversations
<plh> ... they solve different problems but may be worth looking at it
<plh> ... privacy, redaction, verification of data, compliance, etc.
<plh> ec: interesting indeed. summarizing the context
Reimagining Standards for Voice AI: Interoperability Without Sacrificing Innovation - RJ Burnham
rj: (describes the recent paradigm shift from directed-dialog world to LLM-driven agents)
… (then summarizes the current practices like MCP, A2A, and AGENTS.md but they're proprietary technologies)
… (missing the flow layer, which covers portable interchange representation for voice agents)
… (what standards could help for that?)
… (shows examples)
… (need to evangelize the industries)
dd: question around knowledge graphs
gc: some experience for dialog systems
dd: questions from Yash on the Chat
… I find that thinking of LLM applications as POMDPs is a better mental model when designing conversational agents (voice or otherwise)
… A mental model I like to adopt: Tell voice agents what needs to be achieved, not how to go about achieving it
rj: what the cost would be
… humans do that all the time
… we have to handle that carefully
… should be prescriptive
Governance and Greenlights: Leveraging the '3 Ps' to Standardize Trust, Scale, and Usability in Voice Agent Web Integration - Patricia Lee
<plh> (recording)
dd: unfortunately, we don't have the speaker here but can get questions on chat
… then 10-minute break
[10 minutes break]
Breakout room assignment
Solving Lead vs. Lead: Consistent Pronunciation for WebContent: Zoom Room 2
Hallucination in Automatic Speech Recognition Systems: Zoom Room 3
Multi-Agent Conversational Methodology: Zoom Room 4
Reimagining Standards for Voice AI: Interoperability Without Sacrificing Innovation: Zoom Room 5
Breakout discussions
Plenary discussion again
dd: we can give 15 mins for each breakout group to make report
Solving Lead vs. Lead: Consistent Pronunciation for Web Content - Sarah Wood
sw: didn't generate the official slide deck but had discussion
… we're not solving the same problem for all the languages
fj: some of the walkaround like pronunciation dictionary
sw: issues about misspelling
ec: demonstration of paragraph
… you can read it if some of the characters are exchanges with each other
sw: W3C had pronunciation tf before
… we don't have clear view at the moment
dd: it was a good summary by the tf
… what to do the next
… maybe there should be at least a CG
sw: AI has been changed across devices
ec: notation about pronunciation
… TTS used to just "read" the text
… wondering if we could think about simple solution
sw: AI is pretty good at handling phonetics, so one possible solution
dirk: forum within W3C to continue the discussion?
bk: there is a Web Speech CG
… completely aligned with the universal SSML support
bk: also multiple support by MS, Google, Igalia
… taking TTS once it's got Chartered, probably next year
… different supports like HTML, ARIA, etc.
dd: how long do we have?
dirk: 10 more minutes?
dd: wondering how much prounciation issiues
… a lot of languages have hints
… so wondering differing languages have different issues
sw: also issues on proper names and abbreviations
dd: right
… and Spanish has a lot of dialects
sw: local languages even within dialects
jl: how words are pronounced in France, etc.
… lexicon should have information about that
sw: right
jl: different pronunciation by different people
sw: right
… teachers restrict the variations
jl: customization setting
kaz: so we need to think about who from which area is speaking as well as the language itself
Reimagining Standards for Voice AI: Interoperability Without Sacrificing Innovation - RJ Burnham
rj: how people support pronunciation and SSML
… 80% of the interface of the AI agents are same with each other
… there is no architectural mismatch
… but we would remove the remaining pain
… the bigger question is the value
… and complexity of dialogues
… (reflects the previous issues at the time of VoiceXML)
… what is the input and output
… a lot of context there
… this is one of the pain points
… there is not consensus yet
… probably more exploratory work is needed
… would be challenging to manage the complexity
sw: exactly we want different browsers
… any suggestions?
rj: good question
… this type of problem is hard to solve in a standardized manner
… engineers from key stakeholders to discuss with each other in a common way
… very hard to get consensus
… need a champion
ec: agree we need the next step of VoiceXML
… how to define the situation for generalization?
rj: very much the case
ec: what is the simple way to specify this?
dd: there is a mechanism of CG
… also breakout session during TPAC
plh: yeah
… fyi, there is the breakouts day in March too
… don't have to wait until TPAC
https://
Multi-Agent Conversational Methodology - Emmett Coin
ec: nice discussion about cultural differences
… automatically differing speakers
… standardizing languages like the ESL level
… wide range of age, culture, etc.
… also ideas of interaction layer of behavior
se: conversation patterns
… would get expertise on legality, etc., from agents
ec: excellent
… in simple way, only speak to one person
… but some one talks with one agent, and another agent can join the conversation
… simple and rule-based approach is possible
… simple interaction
… we talked about a layer of mentality also
… for various generations
… we could add those points to make the conversation smoother
dd: very different kind of agents?
ec: do you want every agent to be differentiated?
kaz: for that purpose, we might want to think about some dialogue management model
… also we could explicitly characterize each agent one by one, e.g., a funny agent and a diligent agent
ec: agree
… maybe we could have some manifest and persona for that purpose
… in some case, we need a serous guy without any jokes
dd: we used to have persona designers
ec: advertise it and other people can see that
… could have prototype design in some way
us: would rather see one for some specific viewpoint
… but can't see the usefulness to have 5 different personalities one by one
ec: there could be a personality for each agent
fj: probably good to have one from psychological viewpoint
ec: different services could be provided by different agents one by one, train, hotel, etc.
us: it's about "trust", I think
… the aspect of the task
… if you go back to the user
… I trust one from some specific service
ec: could think about various possibilities
dd: one use case could be considered is
… would it be strange to ask one specific agent to handle different tasks?
ec: from implementation viewpoint, much easier to let one agent handle one occasion
Tomorrow
dd: will have the next session tomorrow
… 10 mins for each presentation
ec: the content on Zoom chat is useful
dd: could be recorded, I think
[Session 1 adjourned]