17:00:31 <RRSAgent> RRSAgent has joined #smartagents-main
17:00:35 <RRSAgent> logging to https://www.w3.org/2026/02/27-smartagents-main-irc
17:00:51 <kaz> meeting: W3C Workshop on
17:00:51 <kaz> Smart Voice Agents - Session 3
17:01:56 <kaz> present: KazuyukiAshimura, plh, DeborahDahl, Dirk_Schnelle-Walka, EmmettCoin, PatriciaLee, RajTumuluri, CaseyKennington, FrankieJames, GerardChollet
17:02:20 <kim> kim has joined #smartagents-main
17:02:36 <plh> present+ SarahWood
17:03:00 <plh> present+ JimSaiya
17:03:12 <kaz> topic: Scene setting
17:03:16 <plh> present+ LisaMichaud
17:03:37 <plh> present+ GinaSmith
17:03:37 <kaz> dd: (gives summary from the previous sessions)
17:03:49 <kaz> ... (and also gives instructions about the logistics)
17:04:20 <kaz> ... (asks people to put their full name on the Zoom participants list)
17:05:36 <plh> present+FaresAbawi
17:05:44 <plh> present+ KimPatch
17:06:10 <kaz> topic: Do we need real-time processing capabilities on voice agents? - Casey Kennington
17:06:52 <kaz> ck: (starts with a demo of a voice agent)
17:08:00 <kaz> ... (what about speech?)
17:08:32 <kaz> ... (challenges)
17:09:00 <plh> present+ BrianKardell
17:09:08 <plh> present+ YashGhelani
17:09:28 <plh> present+ UlrikeStiefelhagen
17:09:55 <kaz> ... (spoken interaction, turn-taking, clarification requests, humans process language level...)
17:10:21 <kaz> ... (fast, word-level speech setting)
17:12:43 <kaz> ... (importance of incremental, word-by-word speech processing)
17:14:25 <kaz> ... (where can I start - incremental dialogue processing)
17:14:36 <kaz> ... (retico-team)
17:15:33 <plh> --> https://github.com/retico-team retico-team
17:16:12 <plh> present+ MattShomphe
17:16:27 <plh> present+ SmanthaEstoesta
17:16:28 <kaz> rt: how is this helping turn-taking?
17:16:39 <kaz> ck: there is a model
17:16:49 <kaz> ... two microphone channels
17:17:21 <kaz> ... duplex model by Koji Inoue
17:17:39 <bkardell> present+
17:17:44 <kaz> ec: incremental recognizer results
17:17:54 <fabawi> fabawi has joined #smartagents-main
17:17:56 <kaz> ... some sort of engine for trajectory?
17:18:08 <kaz> ck: google asr is incremental
17:18:25 <kaz> ec: methodology?
17:18:46 <kaz> ... what are you doing in  methodology wise?
17:18:56 <kaz> ck: not doing our own sts processing
17:20:21 <kaz> kaz: interested in the proposal of timing handling model
17:20:57 <kaz> dirk: anything beyond actions?
17:21:06 <kaz> ... user completing the input
17:21:22 <kaz> ck: verbal feedback for English/Japanese
17:21:38 <kaz> ... there is a model for that purpose
17:21:52 <kaz> ... you can use redico for that
17:22:00 <kaz> ... but need to be careful
17:22:08 <plh> s/redico/retico/
17:22:20 <kaz> ... sometimes people stop speaking
17:22:33 <kaz> ... anyway there is a model proposed by Koji Inoue
17:22:41 <kaz> gc: training model for dialog?
17:22:52 <kaz> ck: it's modular
17:23:25 <kaz> ... if you're interested in complex systems, you still can use retico
17:24:33 <Ulrike> Ulrike has joined #smartagents-main
17:24:41 <kaz> ... time alignment for multimodal systems
17:25:36 <kaz> topic: Voice Agents for In-Vehicle Interaction - Frankie James
17:25:59 <kaz> fj: (describes her background in the automotive industry)
17:26:12 <kaz> ... (modern vehicle infotainment)
17:26:29 <kaz> s/infotainment/infotainment with touch screens/
17:26:40 <kaz> ... (but how usable?)
17:27:48 <kaz> ... (demo of chevrolet)
17:28:00 <kaz> s/demo/example/
17:28:22 <kaz> ... (how to lock the door)
17:28:32 <kaz> s/door/door using the GUI/
17:28:53 <kaz> ... (6 screens to be used)
17:29:08 <kaz> ... (navigation control is not allowed)
17:29:45 <kaz> ... (touchscreens/buttons can't be the final word in vehicle HMI)
17:29:56 <jsaiya> jsaiya has joined #smartagents-main
17:30:12 <kaz> ... (that's why voice agents!)
17:30:31 <plh> present+ BryanVuong
17:31:42 <kaz> ... (can gain information without distraction)
17:31:50 <kaz> ... (open research issues)
17:33:03 <kaz> ... (difficulty with recognition in vehicle)
17:33:16 <kaz> ... (focus on driving task)
17:33:34 <kaz> ... (limited attention for secondary tasks)
17:34:50 <kaz> sw: Why did we not just go from older control set up with physical buttons directly to Voice interactions and skip the touch screen only stage? Was it just the voice channel limitations that is the primary reason?
17:34:53 <Casey> When should systems not speak when driving? https://dl.acm.org/doi/10.1145/2667317.2667332
17:35:25 <kaz> fj: (describes the history)
17:35:41 <kaz> sw: safety trade-off
17:36:12 <kaz> fj: it's ongoing questions
17:36:18 <Casey> Another one: https://dl.acm.org/doi/10.1145/2663204.2663244
17:36:20 <kaz> s/questions/question/
17:36:43 <kaz> ms: are there processing limitations too?
17:37:04 <kaz> fj: getting more and more onboading computing
17:37:10 <kaz> ms: interesting
17:37:31 <kaz> ... are there ways to use phones operate the vehicle?
17:37:41 <kaz> ... don't really need a third party device?
17:37:48 <kaz> fj: good question
17:37:55 <kaz> ... actually being looked at
17:38:32 <kaz> ... smartphone vendors would like to take over more and more capabilities
17:38:47 <kaz> ... questions around onboarding vs offboarding
17:39:09 <kaz> rt: can still be paired
17:39:20 <kaz> ... also you can handle multimodal cases
17:40:31 <kaz> gc: regarding autonomous vehicle, there are various cases
17:40:55 <kaz> fj: think they're doing good jobs
17:41:17 <kaz> gc: we can have microphone arrays in vehicle
17:41:25 <kaz> ... it's much improved these days
17:41:30 <kaz> fj: yeah
17:41:37 <kaz> ... but it costs much
17:41:56 <kaz> ck: was working on research
17:42:09 <kaz> ... if the driving is driving on a straight road...
17:42:37 <kaz> ... if someone is sitting next to you, can stop talking
17:42:45 <kaz> ... put several resources on IRC
17:42:54 <kaz> ... think the answer is incremental processing
17:43:04 <kaz> ... may stop talking depending on the situation
17:43:05 <sensingturtle> sensingturtle has joined #smartagents-main
17:43:09 <kaz> fj: glad to know
17:44:47 <kaz> kaz: what about multimodality?
17:44:50 <kaz> fj: right way to go
17:45:26 <kaz> ... tactic feedback to be used like vibration
17:46:21 <kaz> ec: when I use google maps, there is a button using a different recognizer
17:46:32 <kaz> fj: good point
17:47:18 <kaz> topic: Trust & Empathy with Multimodal Assistants - Raj Tumuluri
17:48:21 <kaz> rt: (engineering empathy in multimodal AI)
17:49:25 <kaz> ... ("cold" capability gap)
17:51:42 <kaz> ... (e-TRICE: human-centric reliability model)
17:56:46 <Frankie> Frankie has joined #smartagents-main
17:58:00 <kaz> ... ("warm" handshaking)
17:58:12 <kaz> ... ("sentient" agent)
17:59:50 <kaz> ... (shows examples)
18:01:19 <kaz> ... (creating digital twins for humans)
18:01:47 <kaz> dd: a couple minutes for questions?
18:02:05 <kaz> ... 5 mins for demo
18:02:12 <kaz> rt: (shows a demo)
18:05:23 <kaz> s/demo/live demo/
18:06:23 <plh> sw: How does this work with people who are moving around a lot by baseline like kids for example. Some kids in classrooms have a hard time standing still
18:08:09 <kaz> topic: Beyond Screen Readers: Standardizing Embeddable Voice Agents for Universal Web Accessibility - Bryan Vuong
18:08:35 <kaz> bv: (gives short self intro)
18:08:42 <kaz> ... (accessibility gap)
18:10:14 <kaz> ... (introducing CoBrowse AI)
18:10:37 <kaz> ... (describes how it works)
18:13:01 <kaz> ... (intelligent navigation)
18:14:40 <kaz> ... (contextual Q&A and search)
18:16:05 <kaz> rt: how to detect which product is being referred to?
18:16:43 <kaz> dd: questions to be handled later
18:17:02 <kaz> bv: (action & automation)
18:17:58 <kaz> ... (shows a demo)
18:18:33 <kaz> ... (CoBrowse AI Chat with text and voice)
18:21:48 <plh> sw: How did you engage the blind community in the product development?
18:22:09 <kaz> bv: ended up with a problem to understanding
18:22:21 <kaz> ... what is the pain points
18:22:23 <kaz> s/points/point/
18:22:44 <kaz> bk: local model?
18:22:55 <kaz> bv: cloud service
18:23:01 <kaz> ... local component is quite light
18:23:24 <kaz> bk: 2 more questions
18:24:08 <kaz> ... what information is used?
18:24:32 <kaz> ... doing things for proper places
18:25:01 <kaz> ... do you support other mechanism like Android Touch?
18:25:16 <kaz> bv: agent can provide information to the user
18:25:49 <kaz> ... we focus on summarization
18:26:01 <kaz> ... 2nd question, chat interface is used
18:26:05 <kaz> ... users can type in
18:26:25 <kaz> ... voice interface is useful for blind people
18:27:04 <kaz> sw: Are there links I could read more about the user research results with the blind community?
18:27:30 <kaz> bv: we don't really document it, but can share other pointers
18:27:47 <kaz> kp: In the demo, what would have happened if the user hadn’t thought to ask if there were any errors?
18:28:03 <kaz> bv: if there is error, can detect it
18:28:24 <kaz> ... then get back to the user
18:28:31 <kaz> ... users don't have to ask about that every time
18:29:05 <kaz> gc: you have a very quick speech
18:29:37 <kaz> bv: for blind people, very fast conversation is used
18:29:44 <kaz> ... user can interact with the agent
18:29:55 <kaz> gc: tips with speed up?
18:30:14 <kaz> ... abut inteligibility
18:30:28 <kaz> bv: users can change the speed
18:30:57 <kaz> ... if too fast the user can't understand
18:31:42 <kaz> kaz: data model inside standardization?
18:31:50 <kaz> bv: using DOM structure
18:32:00 <kaz> ... with some optimization
18:32:46 <kaz> kaz: asked about that because there are several standards from W3C
18:32:53 <kaz> ... can talk about the detail later
18:33:20 <kaz> dd: 10 min break and then breakout sessions
18:33:48 <kaz> [break till 45 mins past the hour]
18:33:54 <kaz> rrsagent, make log public
18:34:01 <kaz> rrsagent, draft minutes
18:34:02 <RRSAgent> I have made the request to generate https://www.w3.org/2026/02/27-smartagents-main-minutes.html kaz
18:47:44 <kaz> topic: Breakouts
18:47:48 <kaz> dd: (gives instructions)
18:48:37 <kaz> plh: we have 4 rooms for breakouts
18:49:24 <kaz> dd: Bryan Vuong has left
18:49:33 <kaz> dirk: so 3 breakouts
18:50:06 <kaz> plh: when to come back?
18:50:21 <kaz> dirk: half past the hour
18:50:52 <kaz> dd: Philippe, you'll join all the rooms?
18:50:55 <kaz> plh: yes
18:51:05 <kaz> dd: ok. see you in 45 mins
18:51:09 <kaz> [breakouts]
18:51:14 <kaz> rrsagent, draft minutes
18:51:15 <RRSAgent> I have made the request to generate https://www.w3.org/2026/02/27-smartagents-main-minutes.html kaz
18:55:27 <jsaiya> jsaiya has joined #smartagents-main
19:11:38 <jsaiya> jsaiya has joined #smartagents-main
19:32:02 <Zakim> Zakim has left #smartagents-main
19:36:18 <kaz> kaz has joined #smartagents-main
19:36:52 <kaz> topic: Breakout Results
19:37:01 <kaz> rrsagent, draft minutes
19:37:03 <RRSAgent> I have made the request to generate https://www.w3.org/2026/02/27-smartagents-main-minutes.html kaz
19:37:23 <kaz> dd: we have 40 mins for breakout results
19:38:04 <kaz> subtopic: Break 1
19:38:17 <kaz> ck: Kaz reminded me of W3C standards
19:38:31 <kaz> ... we talked about when the model makes mistakes
19:38:46 <kaz> ... how language model can handle that incrementally
19:38:57 <kaz> ... general applications using robots
19:39:05 <kaz> ... also talked about emotions
19:39:53 <kaz> subtopic: Breakout 2
19:40:05 <kaz> fj: talked about various topics
19:40:29 <kaz> ... concept of teach people
19:40:38 <kaz> ... do we put the concept of "car"?
19:41:17 <kaz> ... also concept of teachable moment
19:41:38 <kaz> ... then
19:41:41 <kaz> ... distraction
19:41:59 <kaz> ... if I get a speech interface in vehicle
19:42:13 <kaz> ... may misunderstand what I want
19:42:30 <kaz> ... looking at when the speech agent, and teach the agent
19:43:03 <kaz> ... the speech command may not be recognized in noisy environment
19:43:11 <kaz> ... then
19:43:22 <kaz> ... what the cars to be responsible?
19:43:33 <kaz> ... car taking over the functionality from the phone
19:43:50 <kaz> ... the phone actually know the content
19:44:06 <kaz> ... and we had really interesting discussion on how to collaborate with the voice agents
19:44:15 <kaz> ... more collaborative approach
19:44:21 <kaz> ... working in parallel
19:45:13 <kaz> sw: question in general is privacy
19:45:21 <kaz> s/Break 1/Breakout 1/
19:45:29 <kaz> ... worry about voice fake
19:45:46 <kaz> ec: interested in your own voice?
19:46:12 <kaz> dd: wondering about car voice recognition
19:46:39 <kaz> fj: good reasons for onboading processing
19:46:43 <kaz> ... due to the latency
19:47:14 <kaz> ec: is that onboarding recognition?
19:47:21 <kaz> fj: small model on board
19:47:37 <kaz> dd: we can probably move to a broader discussion
19:47:50 <kaz> ... what we've learned from the whole 3 sessions?
19:48:15 <kaz> ec: talking about things of multiple agents
19:48:21 <kaz> ... incremental approach
19:48:28 <kaz> ... how to use them on vehicle
19:48:33 <kaz> ... various aspects there
19:48:53 <kaz> ... we learned all better
19:49:08 <kaz> fj: one of we discussed during our breakout
19:49:28 <kaz> ... questions when voice recognition was new are still relevant
19:49:49 <kaz> dd: my undergraduate major was psychology
19:49:54 <kaz> ... how people figure out to work
19:50:00 <kaz> ... a lot study was done there
19:50:41 <kaz> ec: basically, many of complex models come from simple ones
19:51:04 <kaz> fj: appreciate you say so, Debbie
19:51:45 <kaz> dd: also observed Kaz's points about what W3C should do for standardization
19:52:23 <kaz> ... then majority of the presentations about what to be done for LLM, etc.
19:52:35 <kaz> ... practical use cases
19:52:58 <kaz> ... there is still much to be done
19:53:18 <kaz> ... we should be thinking what to be standardized at W3C
19:53:33 <kaz> kp: using gaze systems
19:53:38 <kaz> ... and speech
19:53:49 <kaz> ... that's also cool thing to be handled
19:54:00 <kaz> ... users are doing a lot about that
19:54:28 <kaz> dd: another point about playing with LLM
19:54:42 <kaz> ... maybe we should have a standard API for LLM
19:55:33 <kaz> ec: all the recognizers had different interface years ago
19:55:42 <kaz> ... but pretty good now
19:55:45 <kaz> ... much improved
19:55:50 <kaz> dd: tx to W3C :)
19:56:12 <kaz> ... that's my impression
19:56:41 <kaz> ec: remember old browsers, e.g., Mosaic, IE, ...
19:57:05 <kaz> bk: UA compatibility
19:57:16 <kaz> ... any of the browsers
19:57:43 <kaz> ... they're getting in touch
19:57:56 <kaz> dd: also a lot of discussion about speech timing
19:58:05 <kaz> ... very interesting discussion
19:58:13 <kaz> ec: timing about events?
19:58:20 <kaz> dd: no, speech timing
19:58:30 <kaz> s/about speech timing/about timing/
19:58:40 <kaz> ... using incremental recognition
19:58:52 <kaz> ck: big question about multimodal fusion
19:59:21 <kaz> ec: we have that problem with human as well
19:59:36 <kaz> ... some of the signs are significantly delayed
20:00:54 <kaz> kaz: W3C was working on multimodal fusion standard
20:01:05 <kaz> ... also state chart model as concrete handler
20:01:21 <kaz> ... would be nice to revisit those mechanism based on advanced use cases
20:01:27 <kaz> ... like Casey mentioned
20:01:34 <kaz> ec: @@@
20:01:46 <kaz> dd: multimodal fusion
20:01:56 <kaz> ... EMMA was a datamodel for that purpose
20:02:24 <kaz> ec: would make sense to have a slot?
20:02:29 <kaz> dd: yeah
20:02:48 <kaz> ... we don't handle innovation itself, though
20:03:12 <kaz> ... what should the standard for technology people are playing around?
20:03:26 <kaz> ... some of the research areas might be to be standardized
20:03:57 <kaz> ... a couple of things before closing
20:04:06 <kaz> ... what to be done as the next step?
20:04:19 <kaz> topic: Next Steps
20:04:28 <kaz> dd: key takeaways
20:04:58 <kaz> ... also you can send feedback to the ML of the workshop PC
20:05:06 <kaz> ... which you used for paper submission
20:05:14 <kaz> s/key/got various key/
20:05:30 <kaz> ... it's on the workshop page too
20:05:51 <kaz> ... recordings will be also available
20:05:57 <kaz> ... then, what's next?
20:06:14 <kaz> ... there are at least 4 CGs relevant to the topics discussed during the workshop
20:06:46 <kaz> ... voice interaction, autonomous agents on the Web, AI agent protocol, and semantic 3D content accessibility
20:06:59 <kaz> ... we can also start a new CG if needed
20:07:11 <kaz> ... the process is very lightweight
20:07:19 <kaz> ... also
20:07:31 <kaz> ... Philippe mentioned the W3C Breakouts Day in March
20:07:42 <kaz> .. deadline for proposals is March 10
20:08:25 <plh> [[ 25 March, 13:00-15:00 UTC (two 1-hour slots), 26 March, 21:00-23:00 UTC (two 1-hour slots) ]]
20:08:30 <kaz> https://github.com/w3c/breakouts-day-2026
20:08:45 <kaz> dd: then
20:09:05 <kaz> ... W3C TPAC 2026 in October
20:09:27 <kaz> ... hybrid meeting (F2F in Dublin and remote by zoom)
20:09:37 <kaz> ... then
20:10:04 <kaz> ... possible special issue of the Journal on Multimodal Interfaces
20:10:25 <kaz> ... the last slide is for thanking all the PC members
20:10:43 <kaz> ... speakers and attendees!
20:11:08 <kaz> ... the archived recordings will be available at some point on YouTube
20:11:09 <kaz> plh: yes
20:11:19 <kaz> dirk: thanks from me too
20:11:43 <kaz> plh: thanks, Debbie and Dirk, for charing
20:11:53 <kaz> s/charing/chairing/
20:11:59 <kaz> rrsagent, draft minutes
20:12:01 <RRSAgent> I have made the request to generate https://www.w3.org/2026/02/27-smartagents-main-minutes.html kaz
20:12:33 <kaz> dd: btw, if you have a template for the workshop report, would be nice
20:13:01 <kaz> plh: can refer to Brian's brief report :)
20:13:10 <kaz> eport: we had a workshop.
20:13:10 <kaz> It was good.
20:13:10 <kaz> There are recordings.
20:13:12 <kaz> :)
20:13:23 <kaz> s/eport:/Report:/
20:13:31 <kaz> [workshop adjourned]
20:13:43 <kaz> rrsagent, draft minutes
20:13:44 <RRSAgent> I have made the request to generate https://www.w3.org/2026/02/27-smartagents-main-minutes.html kaz