W3C

Efficient media processing - TPAC breakout

18 Sep 2019

Attendees

Present
Anssi Kostiainen, Dominique Hazaël-Massieux, Gary Katsevman, François Daoust, Chris Needham, Josh O'Connor, Stephan Steglish, Wonsuk Lee, Harald Alvestrand, Paul Adenot, Rijubrata Bhaumik, Jan-Ivar Bruaroey, Peter Thatcher, Youenn Fablet, Emad Omara, and a few others.
Chair
Francois Daoust
Scribe
anssik

Contents


Introduction

See intro slides on Efficient audio/video processing

tidoust: looking at media activities across W3C, and interest in ensuring video and audio can be processed efficiently
... we have a number of group in this space, WebRTC, Machine Learning, etc. getting folks together to find shared interests

tidoust: use cases we've considered include e.g. media stream analysis, barcode reading, emotion analysis, speech recognition etc.
... custom codecs in JS or Wasm one emerging area of interest
... common need is to be able to process media somehow, on CPU, GPU, workers, again JS or Wasm, perhaps dedicated chipset in the device, or ML algos, or dedicated API such as face recognition as a higher-level abstraction
... what do I mean with efficiency? avoid duplication of bytes in memory to use that resource efficiently
... how to run multiple operations per stream or frame(?)
... media stream track and stream are the primitives we deal with from WebRTC
... do we need to define the mapping between the two?
... how to hook into media from Wasm context?
... also, what do we want to see standardized?
... some of these use cases we'll see today can be already done today, perhaps not in the most efficient manner
... that's for the introduction
... riju from Intel will show some demos of use cases that can be implemented today
... paul from Mozilla and harald from Google will also chime in with their perspectives

Media processing & media capture

See Media Capture for the Web slides, try out the WebCamera demos and check the riju/WebCamera repository

[riju showing slides]

riju: 70% of internet traffic is media, so processing it efficiently is important
... HDR in browsers is one such experiment, want to capture, process and display photo in browser
... had to add exposure time feature to getUserMedia, use OpenCV.js to create the HDR photo
... to show power of media streams, everything you can do with native cameras, you can do on the web
... using SIMD and workers
... Facebook, Instragram are using native apps for media-related apps, but web can also do that today
... looking at fps we see ~15 frames per second (FPS), real-time use cases require a bit higher performance for reasonable experience
... in these demos, no ML on the client, all processing on the CPU
... Facebook messenger funny hats ported to the web was one of the experiments, 15-20 FPS on a typical laptop
... another use case was scanning, barcode reading, also possible without hw-accelerated implementation per our experiments
... yet another use case, camera pan and tilt
... specialized hardware for depth sensing has been enabled for the web via getUserMedia extension, use cases e.g. background segmentation
... less CPU compute needed when processing offloaded to specialized hardware
... experiments show Wasm with SIMD gave use 5-10x speedup in multiple use cases

???: which resolution is used in these demos?

????: impressive, what is the laptop specifically the demos were run on?

riju: 10 FPS on modern phones

Experience from Web Audio and Mozilla FoxEye's project

paul: so I'm from Mozilla, back in the days, Firefox project did not have access to native APIs

paul: how to do competitive real-time image processing on the web was looked into at Mozilla around 2012-14
... findings were CPU on mobile was working but not working well enough, so investigated GPU path and found gaps that we were able to fix with a proprietary APIs only
... running CPU clocked on screen refresh(?)
... processing model of stacked filters is the most important part
... not compositing, but stacking within the same group
... we stopped Firefox OS and now we have access to better technology
... more recently, we started a WebCodecs effort with Google, separate breakout 16:30 today
... would like to coordinate across these two efforts

Peter: Mapping between Streams and MediaStreamTrack is in the WebCodecs proposal, but it doesn't have to be there. Could be done elsewhere for use cases that need it but don't need access to codecs.

dom: any other feedback from Audio work that could apply to video processing?

paul: your API must be programmable
... Web Audio started as a node composition data flow pairing
... we had lots of nodes, that became an issue
... let's also do a model understanding images are large and we cannot afford to do redundant copies

Insertable Media Processing proposal

See Insertable Media Processing slides

[harald showing slides]

harald: trying to make life easier for people who use RTCPeerConnection
... if I want to solve one problem only, then this is important

[harald showing requirements, scribe assumes slides will be made available, so won't scribe the slide content]

harald: we have a solution that works, as was demonstrated: use media and canvas elements
... we should be able to do better, using those primitives browsers cannot optimize
... existing apps that want to add image processing step, you should not need to redesign everything
... we believe composable objects are the future on the web
... when we see pieces going to the right direction, we should harmonize

[harald showing drawing of WebRTC Streams flow]

harald: let's give web apps ability to insert processing between getUserMedia and PeerConnection blocks
... WebCodecs proposal says, encoder is in Stream
... WHATWG Streams are a concept that can be piped
... this is exactly what we need, how can I put this onto the Web with PeerConnection
... let's do that at insert-time with a Factory pattern

[harald explaining a code example with encoderFactory]

harald: JS has some properties done right
... processing step can be offloaded to a worker that can read/write stream
... the goal of the project is to say: given a Stream add this translation into it -- seems like magic?
... the end

chris: processing in media element, is there a similar solution for that, consider with MSE?

harald: this only for media that arrives with PeerConnection
... similar type of thinking could be applied there, not exactly this solution

Francois: To reformulate, the proposal is to create a hook into PeerConnection to plug WebCodecs. A similar hook is needed in MSE and other APIs.

Jan-Ivar: instead of putting this in sinks, common denominator MediaStreamTrack(?)
... audio worklets need their own context running off the main thread

harald: this should be part of the worker, to solve that architectural issue
... what does PeerConnection actually do with Workers? A bit complex
... APIs might allow codec switching on the fly

Youenn_Fablet_Apple: is this specific to WebRTC? Is MediaRecorder in scope of the proposal?

harald: when working with PeerConnection, porting over to MediaRecorder et al. doable

Youenn_Fablet_Apple: funny hats would probably happen pre-encoding
... WebRTC questions, between PeerConnection and codec, what happens on the network level, two params depending on network connection, how to deal with that?

harald: PeerConnection framework would send the messages, if it needs to modify things it can

Emad_Omara_Google: changing format after encoding?

harald: we haven't defined the format yet, so can do anything

Jan-Ivar: a different API like VideoWorklet, would it be a better abstraction?

Youenn_Fablet_Apple: Stream piped was a great model, similar to audio graph, any learning from that?

paul: cloning an area of optimization, separable filters

Next steps

tidoust: 2 minutes left! Where can we continue this discussion?
... WebCodecs seem part of the solution, a generic solution may have to encompass other building blocks. We haven't discussed exposing to GPU for instance.
... No real group now, WebRTC, Audio, Media groups available but split attention, do we want a dedicated CG for incubating the architecture?

harald: we have a point problem and point solution, we have a field of problems that need a field of solutions

tidoust: any suggestions on a venue?
... Media WG could be it, but not in scope, maybe M&E IG?

chris: we could consider to reformulate the IG to be the place

tidoust: most stakeholders in this room are in Media WG
... thank you all!
... thanks Anssi for scribing!!!

[End of minutes]

Minutes manually created (not a transcript), formatted by David Booth's scribe.perl version 1.154 (CVS log)
$Date: 2019/09/26 10:17:47 $