Breakout - Do we need real-time processing capabilities on voiceagents?

Meeting minutes

ck: (shows a research paper)
… (about speech recognizer)

aclanthology.org/E09-1081.pdf

ck: there are 3 operations
… example of hypothesis revision

rt: (lefts)

ec: backtrack some sort of semantics?
… vector of meaning

ck: recognizer doesn't detect meaning itself
… there is no need for revoking or updating
… the next step is understanding
… some map for specific task
… models for tasks
… then semantics model
… hypothesis applied
… the problem with recurrent networks
… with larger context
… the other problem is taking a lot of time for training
… memory errors, etc.

ec: in some way biased

ck: bias is there
… but understanding is not
… generalization is a bias
… none of the language models here is incremental
… output is incremental
… and input can be
… kind of R&D stage now

ec: an agent listening to me

ck: fairly well research question. Frankie was talking about car environment.

ec: we did studies about similar situation (about car environment)

ck: people are chatting with AI agents
… worth speaking with humans

ec: have zero recollection about what was done while driving

ck: drivers paying attention for driving
… that's kind of another direction
… the model should interrupt you depending on the situation
… the work by Koji Inoue
… doing turn-taking
… 4 duplex microphones
… feedback model has signal
… good starting point for interruption model

ec: who is speaking is a key

ck: we have two speaker dialization
… (describes a model)

ec: Amazon, etc., handles different speakers at home, for example

ck: dialization is a question at the moment

ec: talking to Agents is interesting
… learning conversation style might be a bit different

ck: think that's a feature of bug
… it's not having state information
… maybe language model should handle this
… (problem with in-car system)
… what's on my calendar today?
… need a safe place to listen the response
… want some sort of introductory
… hard thing is difficulty with lane change, etc.

ec: agree that's important

kaz: which part to be standardized within W3C?

ck: not at the moment
… but streaming?
… for agentic architecture
… streaming language model

kaz: there are many related standards already
… as possible pieces
… should dive into actual use cases and see what is still missing

ck: robot should be a promising use case
… human's expectation is much high
… maybe LLM can do something
… interactive quality of understanding feedback
… I utter something, then robot responds
… but that's flustrating
… if robots could not just hook up APIs but could work for natural communication, would be great

kaz: tx

ec: @@@

ck: people treat robot as having some emotion/human quality
… using a robot like Wally
… should use appropriate vocabulary for users' age

ec: what is your future?

ck: very different topic
… emotion

ec: read Minsky's book?

ck: not that but interesting research papers
… some funding for emotion for robots

ec: note actors act emotional

ck: there are a lot of culture-specific emotions

ec: think "disgust" is one of the universal emotions

kaz: (mentions generic format to handle emotion information, EmotionML)

https://www.w3.org/TR/emotionml/

ec: @@@

ck: there are researchers on that
… check ACL
… Microsoft research, etc.
… multi user attention engagement

– DRAFT –
Breakout - Do we need real-time processing capabilities on voiceagents?

27 February 2026

Attendees

Meeting minutes

Diagnostics