W3C

– DRAFT –
(MEETING TITLE)

20 October 2021

Attendees

Present
Anssi_Kostiainen, Barbara_Hochgesang, Carine_Bournez, Chris_Needham, Chun_Gao, Cullen_Jennings, Dominique_Hazael_Massieux, Dwalka, Eero_Häkkinen, Elad_Alon, Eric_Carlson, Eric_Mwobobia, Florent_Castelli, Fu_Xiao, Harald_Alvestrand, Hunter_Loftis, Jan-Ivar_Bruaroey, Jeffrey_Jaffe, juliana, Julien_Rossi, Jungkee_Song, Justin_Pascalides, Kazuhiro_Hoya, Lance_Deng, Larry_Zhao, Lea_Richard, Louay_Bassbouss, mamatha, Marcelo_Xella, martin_wonsiewicz, Peipei_Guo, Philippe_Le_Hegaret, Priya_B, Randell_Jesup, Rijubrata Bhaumik, Ruinan_Sun, Shengy, Shengyuan_Feng, Shihua_Du, Takio_Yamaoka, Tove, Tuukka_Toivonen, Xiaoqian_Wu, Xuan_Li, Xueyuan_Jia, Yajun_Chen, youenn, zhenjie, Zhibo_Wang, Zoltan_Kis
Regrets
-
Chair
-
Scribe
dom

Meeting minutes

Slideset: https://lists.w3.org/Archives/Public/www-archive/2021Oct/att-0003/TPAC_2021_breakout_WebRTC_-_NV_Use_cases.pdf

[ Slide 2 ]

Riju: massive surge of popularity of a number of features in native apps
… can we make these features available to the Web Platform by leveraging the underlying platform support, often accelerated by dedicated hardware
… the top features we need are listed in the slide

[ Slide 3 ]

Riju: [shows video face detection API]

[ Slide 4 ]

Riju: all video platfrom stack do some kind of face detection
… in fact many cameras optimize their processing (e.g. exposure) based on face detection
… most of that is done via neural network models
… today, on the Web this would be done with e.g. tensorflow.js with a WASM or GPU backend
… there are cloud-based solutions for this
… these aren't restricted by model size - they can deal with blur, emotion, ...
… some of them are coming to client side as well
… it's always better when the data can be processed locally, and ideally without extra compute
… cameras already tend to do that for optimizing their own behavior

[ Slide 5 ]

Riju: the shape detection API includes a face detector
… we could achieve similar results using proposals like Harald's breakout box
… making the API work directly on MediaStreamTrack would help in terms of developers ergonomics

[ Slide 6 ]

Riju: which knobs to expose on the Web platform?
… this shows a snapshot of the API our implementation is using

[ Slide 7 ]

Riju: all new windows 11 compatible devices have a feature for gaze correction
… facetime on iOS has something similar
… they point to a genuine user need
… Harald mentioned the risk of uncanny valley for these features
… giving somewhat unsettling results - almost but not quite natural
… this may need more data on whether this is worth proceeding

[ Slide 8 ]

Riju: [video demonstrating background blur with standards windows platform feature]
… background blur has been one of the most used feature in videoconference apps

[ Slide 9 ]

Riju: on the Web this can be done with TF with WASM/GPU (e.g. in jitso); Google Meet has that feature as well
… this could be combined with background replacement
… the platform APIs don't expose the blur level at the moment

[ Slide 10 ]

riju: [showing what the API could look like with blur/replacement separated]

[ Slide 11 ]

<xueyuan> uan_Feng, Takio_Yamaoka, Tove, Tuukka_Toivonen, Zhibo_Wang, Fu_Xiao, Xueyuan_Jia, Xiaoqian_Wu, Yajun_Chen, youenn, zhenjie, Zoltan_Kis, Shihua_Du, Elad_Alon, Larry_Zhao, Xuan_Li, Anssi_Kostiainen

[ Slide 12 ]

riju: this is an on-device speech to text demo
… there were a few errors, I'm not a native English speaker
… there is a Web Speech API already available, but we wanted to use local compute capabilities
… There are live-caption capabilities built-in in Chromium
… Wonder if there will be an integration in Web Platform APIs

[ Slide 13 ]

riju: we've also looked at providing noise suppression to the Web platform
… our Proof of Concept isn't ready yet
… track settings has a boolean setting
… maybe it could be extended to an enum? or is it an implementation detail transparent to the developer?
… is it something the UA decides? is it something the end-user decides to keep their audio local?

Harald: interesting presentation
… one of the central things that we need a conversation about is
… what do we do when the platform we're running on integrates multiple things in one place
… e.g. face detection
… when it happens on the camera, the camera has more info than what it passes on in the video stream
… which can't be obtained otherwise later
… but conversely, depending on apps, what you want out of face detection varies a lot
… e.g. if you want to position a hat vs position objects linked to the direction of the gaze
… I find it difficult to design an API that creates interoperability for apps
… we may want to go down a level and talk about allowing video annotations on a frame by frame level
… so that anyone can adapt to their own needs
… while still having standardized annotations to allow any kind of processing

riju: great points
… I was thinking that how useful exposing just a rectangle would be
… platform APIs don't expose masks which would be needed for replacement
… the APIs are distinct for face detection (rectangle) vs background blur
… we were trying to get as much attributes as possible from lower down the stack from the camera
… via a getAttributes call e.g. to get facial landmark
… we need to add more

Cullen: native performance exceeds what can be achieved via WASM today for this type of use cases
… ideally we would be able push a neural network to cameras directly
… computer vision and audio processing are way ahead on native clients than Web apps today
… not possible to achieve the same on the Web

riju: Web exposes only CPU & GPU through WASM and WebGL
… but there is amore and more specialized hardware coming from that kind of workload and available in native apps
… we're trying to mirror how native platform apps are going to do instead of bringing your own framework

ChrisNeedham: thank you for organizing this
… it seems to me that it opens up a number of privacy and ethical questions
… e.g. face detection - how well will it work with different color skins?
… even more of a concern with additional details (age, gender, emotion)
… if we're providing lower level APIs where developers build these capabilities, we're not baking possibly biases primitives in the platform
… That also applies to speech2text - will it work well for all users?
… re privacy, on-device vs in-cloud will be key - people are high awareness of this
… allowing on-device, signaling very clearly when in-cloud would be important

Riju: what I showed was a proof of concept, whose quality depends on the quality of the model
… in this case, it wasn't state of the art
… a better one like SODA from Google would give better results
… biases can be dealt with better models
… exposing to users whether the compute is local or cloud
… would help being transparent with users
… noise suppression algorithms today are running mostly on the cloud
… which may make people not comfortable ; we should give the choice
… In platform APIs, only "blink" and "smile" are exposed consistently
… for age/gender, it's mostly based on a "bring your own model" approach
… the objective of our discussions today is to leverage the underlying platform capabilities instead of brining up frameworks
… when the Web platform cannot provide equivalent capabilities, people use other platforms
… there may need to add additional permissions prompts to help with privacy
… there may be attributes we wouldn't expose (e.g. age, gender

Jan-ivar: I have some concerns
… one is redundancy - we're already working on exposing data to JS on MediaStreamTrack
… also concerns about standardizing across platforms
… not sure browsers should own these features - how far would be go
… e.g. sepia tone, filters, ...
… Maybe face detection browsers would be better positioned
… browsers may do a better job at face tracking e.g. from a diversity aspect

riju: if we want to bridge the gap with native in terms of capabilities
… in terms of WG, I'm happy to look at which groups would be best to incubate this

jan-ivar: Would you expect browser to polyfill the results in software if it's not available from hardware?
… e.g. face detection

riju: face detection is available across Window, ChromeOS, Mac, ...

jan-ivar: then background blur? how to deal with it if not hardware accelerated?

riju: it would depend on the performance

Tuukka: if the platform exposes the API, the browser would expose it, and if not, it would not
… in some cases, it's not clear where in the platform it runs in the camera stack
… but we should assume if the platform exposes it, it brings better performances

Youenn: good and exciting topic
… it's fine if we can expose OS-level data like face detection
… there will be more and more on-device processing with end-to-end-encryption
… so cloud processing will go down
… e.g. speech recognition API should have a local mode
… wasm can always provide a fallback approach for local processing
… we need to look at what producers do in terms of generating the data, if it's consistent across providers
… what consumers expect
… and if there is a sweet spot among all of these
… even if face detection is not exactly what the app wants, it may be a good starting point for apps
… I would start with face detection
… if background blur is possible, maybe - but it can already be done with existing APIs
… going deeper with e.g. binary depth map - maybe, but that seems more difficult, not sure we're there yet
… face detection would be a good way to explore this space

riju: thanks - starting with face detection is indeed what I would suggest
… even if the app doesn't want exactly the rectangle, it still reduces the size of what you have to search for
… re echo cancellation, it was only supported on mac via system settings

youenn: I think the WebRTC WG is the right place to do this week given the tie to media capture work
… raising an issue in mediacapture-extensions

Jeff: thanks, very interesting and exciting to improve the collaboration capabilities of the platform
… I got the impression that some of these ideas have different level of maturity
… I was wondering if you had some thoughts about what's the right location for the work to be done?
… some of it maybe in WebRTC WG, in other WGs, in Community Groups for incubation?

riju: Harald adviced to file face detection and background blur in mediacapture extensions
… initially filed them in image catpure
… they're a mixture of Media/WebRTC WG
… I'll take advice from their chairs
… mediacapture-extensions repo seems to be one of the right places

randell: I'd echo Jan-Ivar's comments
… one thing that concerned me is the possibility of locking-in an API that is tied to specific capabilities on specific hardware
… removing flexibility to future alternatives
… e.g. today face recognition has a rectangle, and the future would provide an ovale
… if you're stick with a rectangle API, you won't be able to take advantage of that
… there are many things provided by hardware, drivers, OS
… we can't add an API for each of those - there needs to thoughts about exposing these hardware and OS features in a more general matters
… that lets us more easily integrate feature capabilities in this framework
… that's what I would suggest to integrate

riju: thank you randell
… this is based on platform APIs available across all platforms today
… these features are used by lots of people, with a high demand
… if we don't provide this on the Web, everyone will have to do it on their own with lower performance
… to keep up, everybody will bring their own framework and infrastructure
… sometimes it might make sense not to bring everything to the Web
… but with the level of popularity...

randell: I'm not saying that we shouldn't do this, but that we need to develop it in a way that allows evolution
… it's too focused on catch-up with current state

Cullen: I would discourage face-detection
… lots of variability of needs, big risks of biases
… it doesn't run on all platforms where browsers ship
… I think background blur would much better
… it's really compute intensive and hard to do well
… it has much less variation in how it would be used

Harald: in order to do background removal well, you actually have to do foreground/person detection
… in terms of exploration that might be useful anyway would be annotating video frames with area of interest
… would be a good assistance to both features but independent to specific applications

Minutes manually created (not a transcript), formatted by scribe.perl version 149 (Tue Oct 12 21:11:27 2021 UTC).