Meeting minutes
Smart Voice Agents - Session 3
Scene setting
dd: (gives summary from the previous sessions)
… (and also gives instructions about the logistics)
… (asks people to put their full name on the Zoom participants list)
Do we need real-time processing capabilities on voice agents? - Casey Kennington
ck: (starts with a demo of a voice agent)
… (what about speech?)
… (challenges)
… (spoken interaction, turn-taking, clarification requests, humans process language level...)
… (fast, word-level speech setting)
… (importance of incremental, word-by-word speech processing)
… (where can I start - incremental dialogue processing)
… (retico-team)
rt: how is this helping turn-taking?
ck: there is a model
… two microphone channels
… duplex model by Koji Inoue
ec: incremental recognizer results
… some sort of engine for trajectory?
ck: google asr is incremental
ec: methodology?
… what are you doing in methodology wise?
ck: not doing our own sts processing
kaz: interested in the proposal of timing handling model
dirk: anything beyond actions?
… user completing the input
ck: verbal feedback for English/Japanese
… there is a model for that purpose
… you can use retico for that
… but need to be careful
… sometimes people stop speaking
… anyway there is a model proposed by Koji Inoue
gc: training model for dialog?
ck: it's modular
… if you're interested in complex systems, you still can use retico
… time alignment for multimodal systems
Voice Agents for In-Vehicle Interaction - Frankie James
fj: (describes her background in the automotive industry)
… (modern vehicle infotainment with touch screens)
… (but how usable?)
… (example of chevrolet)
… (how to lock the door using the GUI)
… (6 screens to be used)
… (navigation control is not allowed)
… (touchscreens/buttons can't be the final word in vehicle HMI)
… (that's why voice agents!)
… (can gain information without distraction)
… (open research issues)
… (difficulty with recognition in vehicle)
… (focus on driving task)
… (limited attention for secondary tasks)
sw: Why did we not just go from older control set up with physical buttons directly to Voice interactions and skip the touch screen only stage? Was it just the voice channel limitations that is the primary reason?
<Casey> When should systems not speak when driving? https://
fj: (describes the history)
sw: safety trade-off
fj: it's ongoing question
<Casey> Another one: https://
ms: are there processing limitations too?
fj: getting more and more onboading computing
ms: interesting
… are there ways to use phones operate the vehicle?
… don't really need a third party device?
fj: good question
… actually being looked at
… smartphone vendors would like to take over more and more capabilities
… questions around onboarding vs offboarding
rt: can still be paired
… also you can handle multimodal cases
gc: regarding autonomous vehicle, there are various cases
fj: think they're doing good jobs
gc: we can have microphone arrays in vehicle
… it's much improved these days
fj: yeah
… but it costs much
ck: was working on research
… if the driving is driving on a straight road...
… if someone is sitting next to you, can stop talking
… put several resources on IRC
… think the answer is incremental processing
… may stop talking depending on the situation
fj: glad to know
kaz: what about multimodality?
fj: right way to go
… tactic feedback to be used like vibration
ec: when I use google maps, there is a button using a different recognizer
fj: good point
Trust & Empathy with Multimodal Assistants - Raj Tumuluri
rt: (engineering empathy in multimodal AI)
… ("cold" capability gap)
… (e-TRICE: human-centric reliability model)
… ("warm" handshaking)
… ("sentient" agent)
… (shows examples)
… (creating digital twins for humans)
dd: a couple minutes for questions?
… 5 mins for demo
rt: (shows a live demo)
<plh> sw: How does this work with people who are moving around a lot by baseline like kids for example. Some kids in classrooms have a hard time standing still
Beyond Screen Readers: Standardizing Embeddable Voice Agents for Universal Web Accessibility - Bryan Vuong
bv: (gives short self intro)
… (accessibility gap)
… (introducing CoBrowse AI)
… (describes how it works)
… (intelligent navigation)
… (contextual Q&A and search)
rt: how to detect which product is being referred to?
dd: questions to be handled later
bv: (action & automation)
… (shows a demo)
… (CoBrowse AI Chat with text and voice)
<plh> sw: How did you engage the blind community in the product development?
bv: ended up with a problem to understanding
… what is the pain point
bk: local model?
bv: cloud service
… local component is quite light
bk: 2 more questions
… what information is used?
… doing things for proper places
… do you support other mechanism like Android Touch?
bv: agent can provide information to the user
… we focus on summarization
… 2nd question, chat interface is used
… users can type in
… voice interface is useful for blind people
sw: Are there links I could read more about the user research results with the blind community?
bv: we don't really document it, but can share other pointers
kp: In the demo, what would have happened if the user hadn’t thought to ask if there were any errors?
bv: if there is error, can detect it
… then get back to the user
… users don't have to ask about that every time
gc: you have a very quick speech
bv: for blind people, very fast conversation is used
… user can interact with the agent
gc: tips with speed up?
… abut inteligibility
bv: users can change the speed
… if too fast the user can't understand
kaz: data model inside standardization?
bv: using DOM structure
… with some optimization
kaz: asked about that because there are several standards from W3C
… can talk about the detail later
dd: 10 min break and then breakout sessions
[break till 45 mins past the hour]
Breakouts
dd: (gives instructions)
plh: we have 4 rooms for breakouts
dd: Bryan Vuong has left
dirk: so 3 breakouts
plh: when to come back?
dirk: half past the hour
dd: Philippe, you'll join all the rooms?
plh: yes
dd: ok. see you in 45 mins
[breakouts]
Breakout Results
dd: we have 40 mins for breakout results
Breakout 1
ck: Kaz reminded me of W3C standards
… we talked about when the model makes mistakes
… how language model can handle that incrementally
… general applications using robots
… also talked about emotions
Breakout 2
fj: talked about various topics
… concept of teach people
… do we put the concept of "car"?
… also concept of teachable moment
… then
… distraction
… if I get a speech interface in vehicle
… may misunderstand what I want
… looking at when the speech agent, and teach the agent
… the speech command may not be recognized in noisy environment
… then
… what the cars to be responsible?
… car taking over the functionality from the phone
… the phone actually know the content
… and we had really interesting discussion on how to collaborate with the voice agents
… more collaborative approach
… working in parallel
sw: question in general is privacy
… worry about voice fake
ec: interested in your own voice?
dd: wondering about car voice recognition
fj: good reasons for onboading processing
… due to the latency
ec: is that onboarding recognition?
fj: small model on board
dd: we can probably move to a broader discussion
… what we've learned from the whole 3 sessions?
ec: talking about things of multiple agents
… incremental approach
… how to use them on vehicle
… various aspects there
… we learned all better
fj: one of we discussed during our breakout
… questions when voice recognition was new are still relevant
dd: my undergraduate major was psychology
… how people figure out to work
… a lot study was done there
ec: basically, many of complex models come from simple ones
fj: appreciate you say so, Debbie
dd: also observed Kaz's points about what W3C should do for standardization
… then majority of the presentations about what to be done for LLM, etc.
… practical use cases
… there is still much to be done
… we should be thinking what to be standardized at W3C
kp: using gaze systems
… and speech
… that's also cool thing to be handled
… users are doing a lot about that
dd: another point about playing with LLM
… maybe we should have a standard API for LLM
ec: all the recognizers had different interface years ago
… but pretty good now
… much improved
dd: tx to W3C :)
… that's my impression
ec: remember old browsers, e.g., Mosaic, IE, ...
bk: UA compatibility
… any of the browsers
… they're getting in touch
dd: also a lot of discussion about timing
… very interesting discussion
ec: timing about events?
dd: no, speech timing
… using incremental recognition
ck: big question about multimodal fusion
ec: we have that problem with human as well
… some of the signs are significantly delayed
kaz: W3C was working on multimodal fusion standard
… also state chart model as concrete handler
… would be nice to revisit those mechanism based on advanced use cases
… like Casey mentioned
ec: @@@
dd: multimodal fusion
… EMMA was a datamodel for that purpose
ec: would make sense to have a slot?
dd: yeah
… we don't handle innovation itself, though
… what should the standard for technology people are playing around?
… some of the research areas might be to be standardized
… a couple of things before closing
… what to be done as the next step?
Next Steps
dd: got various key takeaways
… also you can send feedback to the ML of the workshop PC
… which you used for paper submission
… it's on the workshop page too
… recordings will be also available
… then, what's next?
… there are at least 4 CGs relevant to the topics discussed during the workshop
… voice interaction, autonomous agents on the Web, AI agent protocol, and semantic 3D content accessibility
… we can also start a new CG if needed
… the process is very lightweight
… also
… Philippe mentioned the W3C Breakouts Day in March
… deadline for proposals is March 10
<plh> [[ 25 March, 13:00-15:00 UTC (two 1-hour slots), 26 March, 21:00-23:00 UTC (two 1-hour slots) ]]
dd: then
… W3C TPAC 2026 in October
… hybrid meeting (F2F in Dublin and remote by zoom)
… then
… possible special issue of the Journal on Multimodal Interfaces
… the last slide is for thanking all the PC members
… speakers and attendees!
… the archived recordings will be available at some point on YouTube
plh: yes
dirk: thanks from me too
plh: thanks, Debbie and Dirk, for chairing
dd: btw, if you have a template for the workshop report, would be nice
plh: can refer to Brian's brief report :)
Report: we had a workshop.
It was good.
There are recordings.
:)
[workshop adjourned]