W3C Workshop on – 26 February 2026

Meeting minutes

Smart Voice Agents - Session 2

Scene setting

dd: (gives explanations on the goals, logistics and expectations)

Accessibility of 3D and Immersive Content via Voice Interaction - Zohar Gan

zg: (describes difficulties and challenges around accessibility for 3D/Immersive Content)

kaz: great input for existing W3C groups around media handling and web-based digital twins
… suggest you work with those groups
… can send details about those groups to you later

zg: great

ec: this is great and beneficial

zg: tx!

pdm: quick question
… tx for your presentation
… consideration to natural voice
… whether the developers are thinking of making more natural voice for the system
… naturalness of the voice

zg: so would make the voice more natural?

pdm: yes

zg: doing poc
… good idea to explore
… previous solution was using recording of people's voice itself
… another approach is using Web API for more natural voice and customization
… maybe we can use that
… from the practical viewpoint, can be fast enough

pdm: tx
… advances of natural speech
… so just want to suggest
… UX would be better with more natural speech

Transition of Use Cases for Voice to LLM-based RAG or Agent setups in difficult scenarios - Ulrike Stiefelhagen

us: (present about hallucinations in Voice agents)
… (describes several use cases)
… (existing difficulties with voice interface)
… (mentions patients' Voice Assistant "Juki")
… (pros and cons of voice interface)
… (opportunity of GenAI and Voice)
… (another use case of Workers' Voice Assistant "Helping Harry"
… (within noisy environment)
… (opportunity for GenAI & Voice again)
… (deal with different pronunciations)

ec: years ago working on a similar system for warehouse
… the question was demanding timing
… GenAI model can handle that?

ur: didn't use picking
… we chose workstations build up things
… regarding the speed/timing, we also see difficulty
… don't generate timing
… it's basically done ahead
… don't need variation at the moment
… estimate the timing for our work
… traffic light says time needed
… each of work step depends of fraction of time
… 150 steps around clock
… there is some flexibility

kaz: worked on realtime OS based speech timing for my Ph.D thesis 10 years ago
… it depends on use cases but strict timing management would be useful for more natural dialogue-based communication with GenAI
… would suggest we work on that kind of advanced use cases as welll

Towards Smarter Voice Interfaces: Using Grounding and Knowledge - Kristiina Jokinen

dd: can play your video for you

kj: sorry about my bad connection today

dd: (starts Kristiina's recorded video)

kj: (gives her talk about smarter voice interfaces)
… (challenges for voice interaction)
… (Errors in voice interaction)
… (Grounding: Collaborative mechanism)
… (Knowledge graphs for grounding)
… (Agentic Architecture)

dd: tx!
… do you think you can get questions?

kj: let's try!

ec: 2 questions
… formalism for grounding structure?
… knowledge graphs and DB?
… passed from one AI agent to another?

kj: very important point
… so far we have basically worked for different projects
… the expertise from those projects could be applied
… also explore what we can do
… it's very important to try to work on some formalization
… so that other can easily build their systems
… should dive into a bit more details
… about how to build the systems
… so far we've had kind of dialog modelling aspect
… but then make it computable needs more discussion

ec: another simple question
… do you address the interaction, e.g., about the date using different styles?

kj: good question
… same entity with different expressions within the knowledge graph
… if you actually can refer to the entity which is already grounded, the description could be shorten

ec: tx, interesting

Towards Web Standards for Configurable Naturally Responsive Voice Interaction for AI Agents - Paola Di Maio

dd: Paola has issue with connection
… maybe Gerard could show her recorded video

gc: let me try
… (shows Paola's recorded Video)
… wondering about the sound

dd: try again

pdm: (traditional pipeline for speech-to-speech: STT -> LLM -> TTS)
… (great tech but missing UX)
… (7 critical usability failures)
… (proposed UX requirements)
… (then shows a demo)

ec: any work around speech recognition
… to detect whether the speech is finished or not

pdm: very technical
… there is some work done

kaz: when I was a ASR researcher 10 years ago, worked on a kind of big speech corpus to detect the readiness of utterance for interactive dialog
… maybe we might think about even bigger dialog corpus for today's use cases from research viewpoint

pdm: yeah, many things to be done for research

kj: did you just use the information for end-to-end interaction?

pdm: there is a draft paper as the basis of the demo
… 5 use cases there
… each use case use specific script
… note that models are changing every moment
… variables related to the users also change
… I have an outline

dd: some more discussion would be great during the breakout sessions

Gaze-Aware Dialog Systems - Fares Abawi

gc: (shows recorded video for Fares)
… (Under utilized modality)
… (where gaze matters)
… (multimodal pipeline)
… (multimodal fusion)
… (neural integration)

<PaolaDM> Thanks for setting up the IRC channel

gc: (standardization needs)

pdm: wondering what kind/level of role "gaze" would play

dd: a use case is trend users

pdm: how much train the users for the gaze modality is a question

kaz: reminded us of the W3C MMI WG's work
… maybe we can talk wit him about that too

dd: yeah

us: in the factory context, we can add a context using gaze modality

kj: very interesting work
… a lot of research around gaze and turn taking
… it's kind of point device
… wondering if there is some way to distinguish things from each other
… when we want to take turns
… implying our expectations

dirk: useful for automotive environment

dd: time for wrapping up the presentation part
… and then moving to the breakout part

[10-min break]

Breakouts

dirk: (gives instructions for breakouts)
… suggestion from Debbie was combining some of the talks

Room 1 - Kristiina
Room 2 - Zohar
Room 4 - Ulrike
Room 5 - Paola/Fares

dirk: slides on guidance and notes

Group A (Kristiina Jokinen, Zoom Room 1): https://docs.google.com/presentation/d/1nKTrSX0VmyC1dNyF5e9JegX5Ggwwst4UnlQVR8eL3E8/edit?usp=sharing
Group B (Zohar Gan, Zoom Room 2): https://docs.google.com/presentation/d/1wYd6U3OCj2fFhdKobsRJkWzRPY5knLajxi8o0Csg-8M/edit?usp=sharing
Group C (Fares Abawi, Paola Di Maio, Zoom Room 3): https://docs.google.com/presentation/d/1paQE60SVoe5xmFAHA1cvVmJuVXmdGuMYd0iOGeHwWAI/edit?usp=sharing
Group D (Ulrike Stiefelhagen, Zoom Room 4): https://docs.google.com/presentation/d/1X7Nb5s5fWzd21SX5x5m8-erFFu9FAWNTzGXdDFUCCU4/edit?usp=sharing

dd: dirk has sent an email about how to join the breakout zoom

dirk: did that

plh: btw, this main room will become the Room 1

<zohar> I see a message in room 2 saying the host has another meeting in progress and I can't join

<ddahl> @plh zoom says the host has another meeting in progress when I try to join room 2

<plh> hu...

<plh> I'm in room 2

<plh> only room 1 and 2 are now running. not enough participants in room 3 and 4.

[People coming back to the main room]

(discussion about gase, and also turn taking)

bw: not only gaze but various signals could be used
… that's kind of like playing a boardgame

fa: detecting tension directed to tasks is needed
… the model should detect that
… e.g., my gaze indicating something or sometimes not
… we can have different multiple stages
… and the model can ignore some of them depending on the situation
… you can't expect the model can see everything
… the model has no idea about our intention
… there could be extensions based on our intention like smart glasses
… there are many open questions
… need formalization
… I'm more concerned about operation of users
… if it's very functional, there still could be problems with requiring specific Web application

kj: we need to define scenarios
… and need to know if the gaze means something around user's interaction
… if the gaze helps the system to identify the intention, can be kind of monitor
… but there is another aspect
… what your gaze tells for people
… providing some intention to people and the system
… so that they can manage the situation better
… reminded of some work on standardization
… when we built in some technology or system
… some kind of standardization can be achieved in a sense of technology
… useful and helpful but how to measure the usefulness?

dd: time to wrap up. the other group was talking about accessibility of 3D and Immersive content

zg: media semantic meta data
… important part of the data
… helpful to improve privacy and latency using hybrid approach
… producing semantic metadata for accessibility
… collaboration within W3C would be useful
… WebVTT, WebVMT, etc.
… someone's gaze to identify the intention

dd: tx
… we can continue the plenary discussion here at the main channel

Closing

dd: one comment on gaze
… gaze in the real world is difficult to handle
… much bigger than the screen

ec: what do you look at is a question

fa: as long as you know about the screen
… but completely different with the real world situation
… but quite good way to segment objects
… useful when you talk about entities
… gazing upon something is already a question to be detected
… in a 3D world, segmenting object is useful
… maybe someone else can talk about more

kj: segmentation is related to time information
… probably want to know what kind of gaze pattern is being used
… how difficult to detect it
… time-based segmentation is important

fa: definitely
… just wanted to mention the importance within the real world

kaz: probably we need to think about possible use cases for synchronization of multiple data streams

ec: yes, there are synchronization problems
… how do we synchronize multiple date?
… not just humans vs AI
… need to think about different time scale

kj: need to think about some storage
… helpful memory
… maybe sometimes we once go back to some point

ec: maybe we need to enforce AI to wait
… would be helpful more natural conversations

kj: if there are many people, the conversation tend to split into several sub groups

dd: we have to make sure whatever depending on the culture
… one culture and another
… get used to different cultures
… dangerous to assume one specific culture

ec: yeah

dd: people has different immigration history
… even in US

kj: coaching for elderly adults
… poc experiment to evaluate usefulness
… what topics people wanted to talk about
… and see how the system should react
… it's community-oriented
… in the JP context, the system can know the usage
… in the sort of western context, systems are kind of tools for information
… emotional understanding of intelligence
… it was interesting when you want to decide the system
… can be one for all
… but personalization and adoption would be useful
… working in different cultures would require different viewpoints

pdm: not only culture but there is individual difference as well
… a lot of circumstances there

dd: cultural difference based on cultures

fa: collective cultures
… usually would look at a small cluster depending on some specific culture
… or wide glance for the whole people

ec: sometimes we use "gaze" to focus on something
… carries several meanings
… we can detect reliably the standard usage of "gaze"

us: the difference between "gaze" and "gesture"
… how we look
… how to track gaze?

fa: we use web cam
… something called web gazer for browsers
… first step is calibration
… you don't force the users to purchase special devices

ec: web gazer?

plh: https://webgazer.cs.brown.edu/

fa: this is more like a poc

us: could use gaze tracking
… and the target of the gaze is a square

fa: exactly

pdm: interact with voice?
… if people has difficulty with speaking
… how to use gaze for voice?

fa: first focus on 3 things
… turn-taking
… divergent
… and pointing
… that's my proposal essentially

Wrap-up

dd: a lot of interest in "gaze"
… for example, for turn-taking and divergence
… combination with voice for interaction
… sounds like a recommendation from this group
… anything else?

kaz: integration of multiple modalities/resources including "gaze" is important
… also as Emmett mentioned time synchronization among those resources would be important for advanced use cases

dd: right
… regarding tomorrow's session

Session 3

dd: 4 topics
… real-time processing
… in-vehicle interaction
… trust/empathy
… beyond screen readers

dirk: tx a lot for your presentations!

kj: thank you very much from me too

dd: if you won't be able to join tomorrow's session, all the presentations will be published later

kj: appreciated

[Session 2 adjourned]

– DRAFT –
W3C Workshop on

26 February 2026

Attendees