17:00:31 RRSAgent has joined #smartagents-main 17:00:35 logging to https://www.w3.org/2026/02/27-smartagents-main-irc 17:00:51 meeting: W3C Workshop on 17:00:51 Smart Voice Agents - Session 3 17:01:56 present: KazuyukiAshimura, plh, DeborahDahl, Dirk_Schnelle-Walka, EmmettCoin, PatriciaLee, RajTumuluri, CaseyKennington, FrankieJames, GerardChollet 17:02:20 kim has joined #smartagents-main 17:02:36 present+ SarahWood 17:03:00 present+ JimSaiya 17:03:12 topic: Scene setting 17:03:16 present+ LisaMichaud 17:03:37 present+ GinaSmith 17:03:37 dd: (gives summary from the previous sessions) 17:03:49 ... (and also gives instructions about the logistics) 17:04:20 ... (asks people to put their full name on the Zoom participants list) 17:05:36 present+FaresAbawi 17:05:44 present+ KimPatch 17:06:10 topic: Do we need real-time processing capabilities on voice agents? - Casey Kennington 17:06:52 ck: (starts with a demo of a voice agent) 17:08:00 ... (what about speech?) 17:08:32 ... (challenges) 17:09:00 present+ BrianKardell 17:09:08 present+ YashGhelani 17:09:28 present+ UlrikeStiefelhagen 17:09:55 ... (spoken interaction, turn-taking, clarification requests, humans process language level...) 17:10:21 ... (fast, word-level speech setting) 17:12:43 ... (importance of incremental, word-by-word speech processing) 17:14:25 ... (where can I start - incremental dialogue processing) 17:14:36 ... (retico-team) 17:15:33 --> https://github.com/retico-team retico-team 17:16:12 present+ MattShomphe 17:16:27 present+ SmanthaEstoesta 17:16:28 rt: how is this helping turn-taking? 17:16:39 ck: there is a model 17:16:49 ... two microphone channels 17:17:21 ... duplex model by Koji Inoue 17:17:39 present+ 17:17:44 ec: incremental recognizer results 17:17:54 fabawi has joined #smartagents-main 17:17:56 ... some sort of engine for trajectory? 17:18:08 ck: google asr is incremental 17:18:25 ec: methodology? 17:18:46 ... what are you doing in methodology wise? 17:18:56 ck: not doing our own sts processing 17:20:21 kaz: interested in the proposal of timing handling model 17:20:57 dirk: anything beyond actions? 17:21:06 ... user completing the input 17:21:22 ck: verbal feedback for English/Japanese 17:21:38 ... there is a model for that purpose 17:21:52 ... you can use redico for that 17:22:00 ... but need to be careful 17:22:08 s/redico/retico/ 17:22:20 ... sometimes people stop speaking 17:22:33 ... anyway there is a model proposed by Koji Inoue 17:22:41 gc: training model for dialog? 17:22:52 ck: it's modular 17:23:25 ... if you're interested in complex systems, you still can use retico 17:24:33 Ulrike has joined #smartagents-main 17:24:41 ... time alignment for multimodal systems 17:25:36 topic: Voice Agents for In-Vehicle Interaction - Frankie James 17:25:59 fj: (describes her background in the automotive industry) 17:26:12 ... (modern vehicle infotainment) 17:26:29 s/infotainment/infotainment with touch screens/ 17:26:40 ... (but how usable?) 17:27:48 ... (demo of chevrolet) 17:28:00 s/demo/example/ 17:28:22 ... (how to lock the door) 17:28:32 s/door/door using the GUI/ 17:28:53 ... (6 screens to be used) 17:29:08 ... (navigation control is not allowed) 17:29:45 ... (touchscreens/buttons can't be the final word in vehicle HMI) 17:29:56 jsaiya has joined #smartagents-main 17:30:12 ... (that's why voice agents!) 17:30:31 present+ BryanVuong 17:31:42 ... (can gain information without distraction) 17:31:50 ... (open research issues) 17:33:03 ... (difficulty with recognition in vehicle) 17:33:16 ... (focus on driving task) 17:33:34 ... (limited attention for secondary tasks) 17:34:50 sw: Why did we not just go from older control set up with physical buttons directly to Voice interactions and skip the touch screen only stage? Was it just the voice channel limitations that is the primary reason? 17:34:53 When should systems not speak when driving? https://dl.acm.org/doi/10.1145/2667317.2667332 17:35:25 fj: (describes the history) 17:35:41 sw: safety trade-off 17:36:12 fj: it's ongoing questions 17:36:18 Another one: https://dl.acm.org/doi/10.1145/2663204.2663244 17:36:20 s/questions/question/ 17:36:43 ms: are there processing limitations too? 17:37:04 fj: getting more and more onboading computing 17:37:10 ms: interesting 17:37:31 ... are there ways to use phones operate the vehicle? 17:37:41 ... don't really need a third party device? 17:37:48 fj: good question 17:37:55 ... actually being looked at 17:38:32 ... smartphone vendors would like to take over more and more capabilities 17:38:47 ... questions around onboarding vs offboarding 17:39:09 rt: can still be paired 17:39:20 ... also you can handle multimodal cases 17:40:31 gc: regarding autonomous vehicle, there are various cases 17:40:55 fj: think they're doing good jobs 17:41:17 gc: we can have microphone arrays in vehicle 17:41:25 ... it's much improved these days 17:41:30 fj: yeah 17:41:37 ... but it costs much 17:41:56 ck: was working on research 17:42:09 ... if the driving is driving on a straight road... 17:42:37 ... if someone is sitting next to you, can stop talking 17:42:45 ... put several resources on IRC 17:42:54 ... think the answer is incremental processing 17:43:04 ... may stop talking depending on the situation 17:43:05 sensingturtle has joined #smartagents-main 17:43:09 fj: glad to know 17:44:47 kaz: what about multimodality? 17:44:50 fj: right way to go 17:45:26 ... tactic feedback to be used like vibration 17:46:21 ec: when I use google maps, there is a button using a different recognizer 17:46:32 fj: good point 17:47:18 topic: Trust & Empathy with Multimodal Assistants - Raj Tumuluri 17:48:21 rt: (engineering empathy in multimodal AI) 17:49:25 ... ("cold" capability gap) 17:51:42 ... (e-TRICE: human-centric reliability model) 17:56:46 Frankie has joined #smartagents-main 17:58:00 ... ("warm" handshaking) 17:58:12 ... ("sentient" agent) 17:59:50 ... (shows examples) 18:01:19 ... (creating digital twins for humans) 18:01:47 dd: a couple minutes for questions? 18:02:05 ... 5 mins for demo 18:02:12 rt: (shows a demo) 18:05:23 s/demo/live demo/ 18:06:23 sw: How does this work with people who are moving around a lot by baseline like kids for example. Some kids in classrooms have a hard time standing still 18:08:09 topic: Beyond Screen Readers: Standardizing Embeddable Voice Agents for Universal Web Accessibility - Bryan Vuong 18:08:35 bv: (gives short self intro) 18:08:42 ... (accessibility gap) 18:10:14 ... (introducing CoBrowse AI) 18:10:37 ... (describes how it works) 18:13:01 ... (intelligent navigation) 18:14:40 ... (contextual Q&A and search) 18:16:05 rt: how to detect which product is being referred to? 18:16:43 dd: questions to be handled later 18:17:02 bv: (action & automation) 18:17:58 ... (shows a demo) 18:18:33 ... (CoBrowse AI Chat with text and voice) 18:21:48 sw: How did you engage the blind community in the product development? 18:22:09 bv: ended up with a problem to understanding 18:22:21 ... what is the pain points 18:22:23 s/points/point/ 18:22:44 bk: local model? 18:22:55 bv: cloud service 18:23:01 ... local component is quite light 18:23:24 bk: 2 more questions 18:24:08 ... what information is used? 18:24:32 ... doing things for proper places 18:25:01 ... do you support other mechanism like Android Touch? 18:25:16 bv: agent can provide information to the user 18:25:49 ... we focus on summarization 18:26:01 ... 2nd question, chat interface is used 18:26:05 ... users can type in 18:26:25 ... voice interface is useful for blind people 18:27:04 sw: Are there links I could read more about the user research results with the blind community? 18:27:30 bv: we don't really document it, but can share other pointers 18:27:47 kp: In the demo, what would have happened if the user hadn’t thought to ask if there were any errors? 18:28:03 bv: if there is error, can detect it 18:28:24 ... then get back to the user 18:28:31 ... users don't have to ask about that every time 18:29:05 gc: you have a very quick speech 18:29:37 bv: for blind people, very fast conversation is used 18:29:44 ... user can interact with the agent 18:29:55 gc: tips with speed up? 18:30:14 ... abut inteligibility 18:30:28 bv: users can change the speed 18:30:57 ... if too fast the user can't understand 18:31:42 kaz: data model inside standardization? 18:31:50 bv: using DOM structure 18:32:00 ... with some optimization 18:32:46 kaz: asked about that because there are several standards from W3C 18:32:53 ... can talk about the detail later 18:33:20 dd: 10 min break and then breakout sessions 18:33:48 [break till 45 mins past the hour] 18:33:54 rrsagent, make log public 18:34:01 rrsagent, draft minutes 18:34:02 I have made the request to generate https://www.w3.org/2026/02/27-smartagents-main-minutes.html kaz 18:47:44 topic: Breakouts 18:47:48 dd: (gives instructions) 18:48:37 plh: we have 4 rooms for breakouts 18:49:24 dd: Bryan Vuong has left 18:49:33 dirk: so 3 breakouts 18:50:06 plh: when to come back? 18:50:21 dirk: half past the hour 18:50:52 dd: Philippe, you'll join all the rooms? 18:50:55 plh: yes 18:51:05 dd: ok. see you in 45 mins 18:51:09 [breakouts] 18:51:14 rrsagent, draft minutes 18:51:15 I have made the request to generate https://www.w3.org/2026/02/27-smartagents-main-minutes.html kaz 18:55:27 jsaiya has joined #smartagents-main 19:11:38 jsaiya has joined #smartagents-main 19:32:02 Zakim has left #smartagents-main 19:36:18 kaz has joined #smartagents-main 19:36:52 topic: Breakout Results 19:37:01 rrsagent, draft minutes 19:37:03 I have made the request to generate https://www.w3.org/2026/02/27-smartagents-main-minutes.html kaz 19:37:23 dd: we have 40 mins for breakout results 19:38:04 subtopic: Break 1 19:38:17 ck: Kaz reminded me of W3C standards 19:38:31 ... we talked about when the model makes mistakes 19:38:46 ... how language model can handle that incrementally 19:38:57 ... general applications using robots 19:39:05 ... also talked about emotions 19:39:53 subtopic: Breakout 2 19:40:05 fj: talked about various topics 19:40:29 ... concept of teach people 19:40:38 ... do we put the concept of "car"? 19:41:17 ... also concept of teachable moment 19:41:38 ... then 19:41:41 ... distraction 19:41:59 ... if I get a speech interface in vehicle 19:42:13 ... may misunderstand what I want 19:42:30 ... looking at when the speech agent, and teach the agent 19:43:03 ... the speech command may not be recognized in noisy environment 19:43:11 ... then 19:43:22 ... what the cars to be responsible? 19:43:33 ... car taking over the functionality from the phone 19:43:50 ... the phone actually know the content 19:44:06 ... and we had really interesting discussion on how to collaborate with the voice agents 19:44:15 ... more collaborative approach 19:44:21 ... working in parallel 19:45:13 sw: question in general is privacy 19:45:21 s/Break 1/Breakout 1/ 19:45:29 ... worry about voice fake 19:45:46 ec: interested in your own voice? 19:46:12 dd: wondering about car voice recognition 19:46:39 fj: good reasons for onboading processing 19:46:43 ... due to the latency 19:47:14 ec: is that onboarding recognition? 19:47:21 fj: small model on board 19:47:37 dd: we can probably move to a broader discussion 19:47:50 ... what we've learned from the whole 3 sessions? 19:48:15 ec: talking about things of multiple agents 19:48:21 ... incremental approach 19:48:28 ... how to use them on vehicle 19:48:33 ... various aspects there 19:48:53 ... we learned all better 19:49:08 fj: one of we discussed during our breakout 19:49:28 ... questions when voice recognition was new are still relevant 19:49:49 dd: my undergraduate major was psychology 19:49:54 ... how people figure out to work 19:50:00 ... a lot study was done there 19:50:41 ec: basically, many of complex models come from simple ones 19:51:04 fj: appreciate you say so, Debbie 19:51:45 dd: also observed Kaz's points about what W3C should do for standardization 19:52:23 ... then majority of the presentations about what to be done for LLM, etc. 19:52:35 ... practical use cases 19:52:58 ... there is still much to be done 19:53:18 ... we should be thinking what to be standardized at W3C 19:53:33 kp: using gaze systems 19:53:38 ... and speech 19:53:49 ... that's also cool thing to be handled 19:54:00 ... users are doing a lot about that 19:54:28 dd: another point about playing with LLM 19:54:42 ... maybe we should have a standard API for LLM 19:55:33 ec: all the recognizers had different interface years ago 19:55:42 ... but pretty good now 19:55:45 ... much improved 19:55:50 dd: tx to W3C :) 19:56:12 ... that's my impression 19:56:41 ec: remember old browsers, e.g., Mosaic, IE, ... 19:57:05 bk: UA compatibility 19:57:16 ... any of the browsers 19:57:43 ... they're getting in touch 19:57:56 dd: also a lot of discussion about speech timing 19:58:05 ... very interesting discussion 19:58:13 ec: timing about events? 19:58:20 dd: no, speech timing 19:58:30 s/about speech timing/about timing/ 19:58:40 ... using incremental recognition 19:58:52 ck: big question about multimodal fusion 19:59:21 ec: we have that problem with human as well 19:59:36 ... some of the signs are significantly delayed 20:00:54 kaz: W3C was working on multimodal fusion standard 20:01:05 ... also state chart model as concrete handler 20:01:21 ... would be nice to revisit those mechanism based on advanced use cases 20:01:27 ... like Casey mentioned 20:01:34 ec: @@@ 20:01:46 dd: multimodal fusion 20:01:56 ... EMMA was a datamodel for that purpose 20:02:24 ec: would make sense to have a slot? 20:02:29 dd: yeah 20:02:48 ... we don't handle innovation itself, though 20:03:12 ... what should the standard for technology people are playing around? 20:03:26 ... some of the research areas might be to be standardized 20:03:57 ... a couple of things before closing 20:04:06 ... what to be done as the next step? 20:04:19 topic: Next Steps 20:04:28 dd: key takeaways 20:04:58 ... also you can send feedback to the ML of the workshop PC 20:05:06 ... which you used for paper submission 20:05:14 s/key/got various key/ 20:05:30 ... it's on the workshop page too 20:05:51 ... recordings will be also available 20:05:57 ... then, what's next? 20:06:14 ... there are at least 4 CGs relevant to the topics discussed during the workshop 20:06:46 ... voice interaction, autonomous agents on the Web, AI agent protocol, and semantic 3D content accessibility 20:06:59 ... we can also start a new CG if needed 20:07:11 ... the process is very lightweight 20:07:19 ... also 20:07:31 ... Philippe mentioned the W3C Breakouts Day in March 20:07:42 .. deadline for proposals is March 10 20:08:25 [[ 25 March, 13:00-15:00 UTC (two 1-hour slots), 26 March, 21:00-23:00 UTC (two 1-hour slots) ]] 20:08:30 https://github.com/w3c/breakouts-day-2026 20:08:45 dd: then 20:09:05 ... W3C TPAC 2026 in October 20:09:27 ... hybrid meeting (F2F in Dublin and remote by zoom) 20:09:37 ... then 20:10:04 ... possible special issue of the Journal on Multimodal Interfaces 20:10:25 ... the last slide is for thanking all the PC members 20:10:43 ... speakers and attendees! 20:11:08 ... the archived recordings will be available at some point on YouTube 20:11:09 plh: yes 20:11:19 dirk: thanks from me too 20:11:43 plh: thanks, Debbie and Dirk, for charing 20:11:53 s/charing/chairing/ 20:11:59 rrsagent, draft minutes 20:12:01 I have made the request to generate https://www.w3.org/2026/02/27-smartagents-main-minutes.html kaz 20:12:33 dd: btw, if you have a template for the workshop report, would be nice 20:13:01 plh: can refer to Brian's brief report :) 20:13:10 eport: we had a workshop. 20:13:10 It was good. 20:13:10 There are recordings. 20:13:12 :) 20:13:23 s/eport:/Report:/ 20:13:31 [workshop adjourned] 20:13:43 rrsagent, draft minutes 20:13:44 I have made the request to generate https://www.w3.org/2026/02/27-smartagents-main-minutes.html kaz