W3C Workshop on – 27 February 2026

Meeting minutes

Smart Voice Agents - Session 3

Scene setting

dd: (gives summary from the previous sessions)
… (and also gives instructions about the logistics)
… (asks people to put their full name on the Zoom participants list)

Do we need real-time processing capabilities on voice agents? - Casey Kennington

ck: (starts with a demo of a voice agent)
… (what about speech?)
… (challenges)
… (spoken interaction, turn-taking, clarification requests, humans process language level...)
… (fast, word-level speech setting)
… (importance of incremental, word-by-word speech processing)
… (where can I start - incremental dialogue processing)
… (retico-team)

<plh>

rt: how is this helping turn-taking?

ck: there is a model
… two microphone channels
… duplex model by Koji Inoue

ec: incremental recognizer results
… some sort of engine for trajectory?

ck: google asr is incremental

ec: methodology?
… what are you doing in methodology wise?

ck: not doing our own sts processing

kaz: interested in the proposal of timing handling model

dirk: anything beyond actions?
… user completing the input

ck: verbal feedback for English/Japanese
… there is a model for that purpose
… you can use retico for that
… but need to be careful
… sometimes people stop speaking
… anyway there is a model proposed by Koji Inoue

gc: training model for dialog?

ck: it's modular
… if you're interested in complex systems, you still can use retico
… time alignment for multimodal systems

Voice Agents for In-Vehicle Interaction - Frankie James

fj: (describes her background in the automotive industry)
… (modern vehicle infotainment with touch screens)
… (but how usable?)
… (example of chevrolet)
… (how to lock the door using the GUI)
… (6 screens to be used)
… (navigation control is not allowed)
… (touchscreens/buttons can't be the final word in vehicle HMI)
… (that's why voice agents!)
… (can gain information without distraction)
… (open research issues)
… (difficulty with recognition in vehicle)
… (focus on driving task)
… (limited attention for secondary tasks)

sw: Why did we not just go from older control set up with physical buttons directly to Voice interactions and skip the touch screen only stage? Was it just the voice channel limitations that is the primary reason?

<Casey> When should systems not speak when driving? https://dl.acm.org/doi/10.1145/2667317.2667332

fj: (describes the history)

sw: safety trade-off

fj: it's ongoing question

<Casey> Another one: https://dl.acm.org/doi/10.1145/2663204.2663244

ms: are there processing limitations too?

fj: getting more and more onboading computing

ms: interesting
… are there ways to use phones operate the vehicle?
… don't really need a third party device?

fj: good question
… actually being looked at
… smartphone vendors would like to take over more and more capabilities
… questions around onboarding vs offboarding

rt: can still be paired
… also you can handle multimodal cases

gc: regarding autonomous vehicle, there are various cases

fj: think they're doing good jobs

gc: we can have microphone arrays in vehicle
… it's much improved these days

fj: yeah
… but it costs much

ck: was working on research
… if the driving is driving on a straight road...
… if someone is sitting next to you, can stop talking
… put several resources on IRC
… think the answer is incremental processing
… may stop talking depending on the situation

fj: glad to know

kaz: what about multimodality?

fj: right way to go
… tactic feedback to be used like vibration

ec: when I use google maps, there is a button using a different recognizer

fj: good point

Trust & Empathy with Multimodal Assistants - Raj Tumuluri

rt: (engineering empathy in multimodal AI)
… ("cold" capability gap)
… (e-TRICE: human-centric reliability model)
… ("warm" handshaking)
… ("sentient" agent)
… (shows examples)
… (creating digital twins for humans)

dd: a couple minutes for questions?
… 5 mins for demo

rt: (shows a live demo)

<plh> sw: How does this work with people who are moving around a lot by baseline like kids for example. Some kids in classrooms have a hard time standing still

Beyond Screen Readers: Standardizing Embeddable Voice Agents for Universal Web Accessibility - Bryan Vuong

bv: (gives short self intro)
… (accessibility gap)
… (introducing CoBrowse AI)
… (describes how it works)
… (intelligent navigation)
… (contextual Q&A and search)

rt: how to detect which product is being referred to?

dd: questions to be handled later

bv: (action & automation)
… (shows a demo)
… (CoBrowse AI Chat with text and voice)

<plh> sw: How did you engage the blind community in the product development?

bv: ended up with a problem to understanding
… what is the pain point

bk: local model?

bv: cloud service
… local component is quite light

bk: 2 more questions
… what information is used?
… doing things for proper places
… do you support other mechanism like Android Touch?

bv: agent can provide information to the user
… we focus on summarization
… 2nd question, chat interface is used
… users can type in
… voice interface is useful for blind people

sw: Are there links I could read more about the user research results with the blind community?

bv: we don't really document it, but can share other pointers

kp: In the demo, what would have happened if the user hadn’t thought to ask if there were any errors?

bv: if there is error, can detect it
… then get back to the user
… users don't have to ask about that every time

gc: you have a very quick speech

bv: for blind people, very fast conversation is used
… user can interact with the agent

gc: tips with speed up?
… abut inteligibility

bv: users can change the speed
… if too fast the user can't understand

kaz: data model inside standardization?

bv: using DOM structure
… with some optimization

kaz: asked about that because there are several standards from W3C
… can talk about the detail later

dd: 10 min break and then breakout sessions

[break till 45 mins past the hour]

Breakouts

dd: (gives instructions)

plh: we have 4 rooms for breakouts

dd: Bryan Vuong has left

dirk: so 3 breakouts

plh: when to come back?

dirk: half past the hour

dd: Philippe, you'll join all the rooms?

plh: yes

dd: ok. see you in 45 mins

[breakouts]

Breakout Results

dd: we have 40 mins for breakout results

Breakout 1

ck: Kaz reminded me of W3C standards
… we talked about when the model makes mistakes
… how language model can handle that incrementally
… general applications using robots
… also talked about emotions

Breakout 2

fj: talked about various topics
… concept of teach people
… do we put the concept of "car"?
… also concept of teachable moment
… then
… distraction
… if I get a speech interface in vehicle
… may misunderstand what I want
… looking at when the speech agent, and teach the agent
… the speech command may not be recognized in noisy environment
… then
… what the cars to be responsible?
… car taking over the functionality from the phone
… the phone actually know the content
… and we had really interesting discussion on how to collaborate with the voice agents
… more collaborative approach
… working in parallel

sw: question in general is privacy
… worry about voice fake

ec: interested in your own voice?

dd: wondering about car voice recognition

fj: good reasons for onboading processing
… due to the latency

ec: is that onboarding recognition?

fj: small model on board

dd: we can probably move to a broader discussion
… what we've learned from the whole 3 sessions?

ec: talking about things of multiple agents
… incremental approach
… how to use them on vehicle
… various aspects there
… we learned all better

fj: one of we discussed during our breakout
… questions when voice recognition was new are still relevant

dd: my undergraduate major was psychology
… how people figure out to work
… a lot study was done there

ec: basically, many of complex models come from simple ones

fj: appreciate you say so, Debbie

dd: also observed Kaz's points about what W3C should do for standardization
… then majority of the presentations about what to be done for LLM, etc.
… practical use cases
… there is still much to be done
… we should be thinking what to be standardized at W3C

kp: using gaze systems
… and speech
… that's also cool thing to be handled
… users are doing a lot about that

dd: another point about playing with LLM
… maybe we should have a standard API for LLM

ec: all the recognizers had different interface years ago
… but pretty good now
… much improved

dd: tx to W3C :)
… that's my impression

ec: remember old browsers, e.g., Mosaic, IE, ...

bk: UA compatibility
… any of the browsers
… they're getting in touch

dd: also a lot of discussion about timing
… very interesting discussion

ec: timing about events?

dd: no, speech timing
… using incremental recognition

ck: big question about multimodal fusion

ec: we have that problem with human as well
… some of the signs are significantly delayed

kaz: W3C was working on multimodal fusion standard
… also state chart model as concrete handler
… would be nice to revisit those mechanism based on advanced use cases
… like Casey mentioned

ec: @@@

dd: multimodal fusion
… EMMA was a datamodel for that purpose

ec: would make sense to have a slot?

dd: yeah
… we don't handle innovation itself, though
… what should the standard for technology people are playing around?
… some of the research areas might be to be standardized
… a couple of things before closing
… what to be done as the next step?

Next Steps

dd: got various key takeaways
… also you can send feedback to the ML of the workshop PC
… which you used for paper submission
… it's on the workshop page too
… recordings will be also available
… then, what's next?
… there are at least 4 CGs relevant to the topics discussed during the workshop
… voice interaction, autonomous agents on the Web, AI agent protocol, and semantic 3D content accessibility
… we can also start a new CG if needed
… the process is very lightweight
… also
… Philippe mentioned the W3C Breakouts Day in March
… deadline for proposals is March 10

<plh> [[ 25 March, 13:00-15:00 UTC (two 1-hour slots), 26 March, 21:00-23:00 UTC (two 1-hour slots) ]]

w3c/breakouts-day-2026

dd: then
… W3C TPAC 2026 in October
… hybrid meeting (F2F in Dublin and remote by zoom)
… then
… possible special issue of the Journal on Multimodal Interfaces
… the last slide is for thanking all the PC members
… speakers and attendees!
… the archived recordings will be available at some point on YouTube

plh: yes

dirk: thanks from me too

plh: thanks, Debbie and Dirk, for chairing

dd: btw, if you have a template for the workshop report, would be nice

plh: can refer to Brian's brief report :)

Report: we had a workshop.

It was good.

There are recordings.

[workshop adjourned]

– DRAFT –
W3C Workshop on

27 February 2026

Attendees