14:48:52 RRSAgent has joined #web-speech 14:48:56 logging to https://www.w3.org/2024/09/25-web-speech-irc 14:48:56 RRSAgent, do not leave 14:48:57 RRSAgent, make logs public 14:48:58 Meeting: Web Speech API Improvements 14:48:58 Chair: evanbliu 14:48:58 Agenda: https://github.com/w3c/tpac2024-breakouts/issues/21 14:48:58 Zakim has joined #web-speech 14:48:59 Zakim, clear agenda 14:48:59 agenda cleared 14:48:59 Zakim, agenda+ Pick a scribe 14:49:00 agendum 1 added 14:49:00 Zakim, agenda+ Reminders: code of conduct, health policies, recorded session policy 14:49:00 agendum 2 added 14:49:00 Zakim, agenda+ Goal of this session 14:49:01 agendum 3 added 14:49:01 Zakim, agenda+ Discussion 14:49:01 agendum 4 added 14:49:01 Zakim, agenda+ Next steps / where discussion continues 14:49:02 agendum 5 added 14:49:02 tpac-breakout-bot has left #web-speech 21:46:16 dom has joined #web-speech 21:49:08 tidoust has joined #web-speech 21:54:42 hgo has joined #web-speech 21:56:34 nigel_ has joined #web-speech 21:56:49 Present+ Nigel_Megitt 21:58:19 scribe+ nigel 21:58:32 Early introductions (no scribe) 21:58:51 Chair: Evan_Liu 21:59:03 mjwilson has joined #web-speech 21:59:06 dom: Is there a link to LLMs? 21:59:26 Evan_Liu: We might use LLMs in Google internally, but the API won't change 21:59:49 dom: Signal to abort processing? 21:59:59 Evan_Liu: Probably out of scope. 22:00:11 .. We're not talking about changing the core capabilities beyond introducing this new functionality 22:00:38 Topic: Offline Speech Recognition 22:00:52 jcraig has joined #web-speech 22:01:07 https://github.com/WICG/speech-api/issues/108 22:01:23 [slide] 22:01:33 Evan_Liu: These proposals have been around for a while, 22:01:44 .. for on-device speech recognition 22:02:18 .. Proposing to support by introducing two new attributes on the SpeechRecognition interface 22:02:25 .. localService attribute 22:02:31 .. allowCloudFallback attribute 22:02:38 .. both Booleans. Names may change. 22:02:54 .. Cloud speech to text service can support more options than on device 22:03:04 .. Might combine the bools into a 3-value enum 22:03:14 .. Idea is to control where the speech recognition is allowed to happen. 22:03:20 .. New methods 22:03:30 .. allow triggering download of a language pack 22:03:41 .. query if on-device speech recognition is available 22:03:48 .. Privacy concerns 22:04:06 .. Chrome planning to only allow websites to install a language pack if it matches the user's 22:04:27 .. primary language, or if not, ask permission 22:04:39 Jer: What's the use case for disabling local speech recognition? 22:05:00 Evan_Liu: More about allowing server-based STT, but also users may 22:05:14 .. not want to use the CPU on their machine 22:05:25 Jer: When would you ever set local service to false. 22:05:42 .. Sounds like only if the website knows it may be doing other cpu intensive tasks 22:06:39 discussion of on/off device use cases, including device capabilities 22:07:19 smaug has joined #web-speech 22:07:24 asdf 22:07:36 (smaug == Olli Pettay) 22:07:40 TylerWilcock has joined #web-speech 22:07:49 solis has joined #web-speech 22:07:50 smaug: [missed] 22:07:57 ningxin has joined #web-speech 22:08:12 michael_wilson: Why not an enum? 22:08:12 eric_carlson has joined #web-speech 22:08:22 Evan_Lie: No reason, seems like it would be simpler based on this discussion 22:08:36 s/discussion of on/HN: discussion of on/ 22:08:39 Jer: An enum with 3 values would mean each option is clearer to understand 22:09:03 smaug: the "on" prefix... 22:09:11 mjwilson has joined #web-speech 22:09:15 Evan_Liu: Good point maybe should be "is" 22:09:25 Jer: Seems like a property, so a getter would make sense 22:09:54 solis: On Device might make more sense than Local 22:10:07 Evan_Liu: It's matching existing spec text, I prefer On Device too. Any preferences? 22:10:19 dom: The boundary of a device might evolve over time. 22:10:31 .. Trying to communicate the privacy aspect of the choice here. 22:10:46 .. Instead of focusing on this abstract boundary, focus on what is driving the developer's choice. 22:11:00 .. Not sure it matters, as much as the privacy implications based on which environment is shown. 22:11:14 .. Not sure how to formulate this. 22:11:17 .. Hybrid models might exist, so this distinction might not be what we need. 22:11:27 tidoust: The notion of "parties" comes to mind, but also not right. 22:11:40 .. Could say it's First party for the UA, or 3rd party for someone else. 22:11:53 .. Not sure that's the right way either. 22:11:59 .. Google has plenty of hats here. 22:12:41 Evan_Liu: installOnDeviceSpeechRecognition() - returns a boolean, 22:12:54 .. could take minutes to download, so we just return if the fetch has been initiated. 22:13:06 .. Could have event listeners to reveal when the installation is complete 22:13:16 .. Or return true when the installation is complete 22:13:24 eric_carlson: Some of the language packs are quite large. 22:13:38 .. Do we need to be concerned about allowing a page to use a lot of user data in this way? 22:13:47 .. e.g. people on limited data plans 22:14:06 Evan_Liu: Another Chrome criterion is if the user is on cellular net or on WiFi/ethernet. 22:14:23 .. Could be in the spec, or could be in the browser, depending on what people want. 22:14:37 smaug: Feels v scary this install, needs to be async, and behind permissions. 22:14:45 .. May never get past the installation. 22:14:51 .. Maybe return a promise 22:15:13 Evan_Liu: This would be async, and would return as soon as the user signals their preference 22:15:20 smaug: User may not say anything 22:15:36 Evan_Liu: Any concerns about that API? 22:15:46 eric_carlson: Should be a promise 22:15:54 .. May not ever return, or the download may time out 22:16:08 .. To prevent polling, it should resolve once it's been downloaded and is available for use 22:16:14 .. Polling is an anti-pattern 22:16:43 scribe+ 22:16:44 nigel: Do you have a scheme in mind for caching if two pages want to fetch the same language? 22:17:11 Evan_Liu: For Chrome, there will always be only one package per language 22:17:28 smaug: There may be a privacy issue here 22:17:59 Evan_Liu: True. If it requires a pop-up that scares people away, we feel the fingerprinting issue goes away 22:18:14 eric_carlson: Same issue with fonts. Recurring issue. 22:19:15 s/privacy issue here/privacy issue here: where a page polls for installed languages. Fingerprinting problem. 22:19:55 Jer: Not sure if this is something that could be partitioned 22:20:05 .. Always ask the user for every page even if the language pack is already available 22:20:25 eric_carlson: Interesting idea: there was a way to get microphone permissions, which was used 22:20:33 .. for fingerprinting. 22:20:47 .. We returned a fixed list until the user started to capture, and only then return the correct list 22:21:02 .. Do something similar here, signal only the user's language signalled as available, 22:21:10 .. until the API actually asks to download something. 22:21:19 .. If the pack is already there, return more quickly. 22:21:32 .. User would get prompted every time, which would be wierd. 22:21:42 .. If you prompt on download request and then reveal what's actually available, 22:21:47 .. that could be one way to handle it. 22:22:03 dom: If you ignore media stream track at the minute, right now the speech api does go through 22:22:31 .. a prompt. Agree that default language makes sense, but there may be an opportunity 22:22:38 .. for bringing something into the prompt. 22:22:47 .. It's super privacy invasive to do recognition in the first place 22:23:03 eric_carlson: Could be that microphone permission is enough of a barrier 22:23:22 mjwilson: Is it assumed to only work on mic, or other audio resources? 22:23:34 eric_carlson: Wouldn't work for other audio streams 22:23:50 Evan_Liu: Could have auto-download as soon as recognition begins 22:23:56 .. Should this be in the spec? 22:24:08 eric_carlson: The spec must have at least recommendations. 22:24:15 .. Leave room for UAs to figure out other ways 22:24:28 dom: If you leave privacy mitigations and let them have interop impact then its a race to the bottom 22:24:38 .. so agree put something in the spec 22:24:55 jer: Best way would be to only allow user's language STT 22:25:06 .. Could imagine foreign language STT and then translation though 22:25:15 .. Or could have a list of preferred language 22:25:27 dom: Except for duolingo! 22:25:36 jer: that already knows the text though. 22:25:59 dom: hard to separate this from whether it only applies to live capture or not 22:26:15 .. as soon as you cut that link it's not an effective mitigation, so need other mitigation for non-live capture 22:26:32 jer: Misunderstood the duolingo example - the user is speaking and the page wants to know if 22:26:37 .. they spoke correctly 22:26:40 [slide] 22:27:13 s/[slide]/ 22:28:15 nigel: I used text to speech features of browsers and experienced divergence in performance across browsers. Especially during starting times. 22:28:33 ... Any plan to signal "readiness"? 22:28:39 q+ 22:28:44 ... To avoid missing bits? 22:29:14 Evan_Liu: No real requirement in the spec that things must happen within a specified amount of time. 22:29:39 ??1: Presumably, that's a bug, not supposed to drop bits. 22:30:37 Nigel: What about extending to other types of sounds? Sometimes, it's not only speech you're interested in. 22:30:52 ... There are machine-based products that recognize sound. 22:31:43 ??1: My understanding of the proposal is that the main interest is converting speech to text. 22:31:56 s/??1/jcraig 22:32:06 Topic: MediaStreamTrack support 22:32:13 s/??1:/jcraig:/g 22:32:15 [slide] github issue #66 22:32:46 Evan_Liu: start() method requests permission to use microphone, and starts using it if allowed 22:33:15 jer: If you send noise to a speech recognition API, you can figure out what platform is being used 22:33:29 .. so its a big entropy change for fingerprinting 22:33:40 .. needs to be listed in the spec, exposing arbitrary speech recognition 22:34:11 nigel: I haven't understood the use case properly. Why does the page need the API? 22:34:39 eric_carlson has joined #web-speech 22:34:43 Evan_Liu: For video conferencing tools, streams may be coming from elsewhere. 22:35:15 nigel: It does not seem very sustainable to do the work on all clients. 22:35:51 s/API? Why not have the user trigger the microphone and provide a text stream? Why does the page need to know? 22:36:02 s/API?/API? Why not have the user trigger the microphone and provide a text stream? Why does the page need to know? 22:36:38 [discussion of use cases] - live translate and dub 22:36:56 jcraig: Sender side could have better sound quality from local mic 22:37:13 dom: Completely muted conference, only sending speech text 22:37:33 jer: Mitigation could be a mandated delay so a website can't immediately get an answer. 22:37:55 .. Require min 30s before producing text mitigates by forcing website to wait to get fingerprinting data 22:38:11 .. One possible mitigation to privacy impacts. Not a solution, just increases cost for the page. 22:38:22 Evan_Liu: Concern is profiling performance capabilities? 22:38:48 q+ 22:38:49 jer: Not perf, each implementation might give a different answer, so tell devices apart 22:39:18 eric_carlson: Might get fine grained information about the machine 22:39:35 jer: Could be less risky for Chrome having a large number of users with the same characteristics 22:40:00 mjwilson: This isn't theoretical, it actually happened with Web Audio, allows fine grained identification 22:40:11 jer: which is why we bring this up, it's been an active area of attack 22:40:12 q? 22:40:18 ack me 22:40:19 ack mjwilson 22:40:31 Topic: Spoken Punctuation Parameter 22:40:59 Evan_Liu: boolean attribute - if true, uses the punctuation, if false spells out e.g. "comma" 22:41:07 q+ 22:41:12 .. If you're using it for captioning you might want to spell it out 22:41:27 dom: i18n questions about what constitutes punctuation 22:41:39 .. likely to regret a boolean, there will be other choices 22:41:51 Evan_Liu: Could start with an enum to make it more extensible 22:42:02 jcraig: Agree, verbosity of screen readers is use case dependent 22:42:23 Nigel: Does this control unspoken punctuation? 22:42:36 Evan_Liu: No, Google supports it but we haven't had a request for that yet 22:42:40 tidoust: Could be an array 22:42:58 ack ningxin 22:43:16 ningxin: For mass education, want to speak a formula, and have that appear. 22:43:36 .. Send it to the cloud. Previously used Web Speech API with a different backend, but got a different answer 22:43:43 s/HN:/ningxin:/g 22:43:52 q+ 22:43:54 .. Would be helpful to do mathematical representation output 22:43:59 ack mjwilson 22:44:15 mjwilson: Last year MathML WG had presentation on spoken math, very interesting and deep topic 22:44:24 Topic: Remove SpeechGrammar 22:44:34 Evan_Liu: Not implemented, or well defined, requests to remove it. 22:44:43 that would have been Neil Soiffer discussing spoken MathML and the MathJax project 22:44:51 .. Seems to be consensus to remove, not controversial. 22:44:58 .. Intended to do biasing support 22:45:01 Topic: Biasing support 22:45:16 Evan_Liu: Add bias to certain phrases, depends on recognition support. 22:45:30 .. Chrome's recognition supports this. It's pretty generic, not tied to specific use cases. 22:45:53 dom: Guess this would be super useful, would want to include language info and substream tagging 22:46:08 .. If the phrase is used in several languages, you need to tag that, so need a different structure 22:46:19 Evan_Liu: Each recogniser only works for a single language 22:46:39 dom: Does the API need to support multiple language audio? Need to surface in the API. 22:46:51 Evan_Liu: Multi-lang recognition in the same phrase not supported 22:47:10 dom: If I put my French name into an English phrase, the pronunciation would change 22:47:10 q+ to mention musical artists like Ke$ha... technical or domain that don't align well terms (not a perfect example but LaTeX is "la-tek" not "latex") 22:47:14 q+ to mention pages 22:47:24 dom: If I put my Chinese name, that wouldn't work either. 22:47:33 .. Fair, needs to be clear what language the stream is in 22:47:44 .. Don't know how you take it back 22:48:10 jcraig: Blind colleague laughed because Ke$ha was pronounced "Key dollar har" 22:48:24 .. Also other words get pronounced differently by domain experts. 22:48:28 .. Need phonetic spelling 22:48:31 ack jcraig 22:48:31 jcraig, you wanted to mention musical artists like Ke$ha... technical or domain that don't align well terms (not a perfect example but LaTeX is "la-tek" not "latex") 22:49:03 s/Need phonetic spelling/Need phonetic hinting, perhaps IPA "International Phonetic Alphabet"/ 22:50:39 ack n 22:50:39 nigel, you wanted to mention pages 22:50:42 mjwilson has left #web-speech 22:50:58 Nigel: was the case that "pagers" got misrecognised as "pages" recently in the news. 22:51:36 .. So what jcraig said: need to know the phonetic details 22:51:56 .. Other point: need to know how you would layer in speaker recognition, or changes of speaker, 22:51:59 .. again for the captioning use case 22:52:10 mjwilson: Also language recognition 22:52:19 Topic: Meeting close 22:52:27 Evan_Liu: We're 2 minutes over, let's close 22:52:31 rrsagent, make minutes 22:52:33 I have made the request to generate https://www.w3.org/2024/09/25-web-speech-minutes.html nigel 22:54:43 Present+ Evan_Liu, James_Craig, Jer_Noble, Michael_Wilson, Francois_Daoust, Dom, Ninxing, smaug, solis, Eric_Carlson 22:54:47 Chair: Evan_Liu 22:54:48 rrsagent, make minutes 22:54:50 I have made the request to generate https://www.w3.org/2024/09/25-web-speech-minutes.html nigel 23:08:56 dom has left #web-speech