See intro slides on Efficient audio/video processing
tidoust: looking at media
activities across W3C, and interest in ensuring video and audio
can be processed efficiently
... we have a number of group in this space, WebRTC, Machine
Learning, etc. getting folks together to find shared
interests
tidoust: use cases we've
considered include e.g. media stream analysis, barcode reading,
emotion analysis, speech recognition etc.
... custom codecs in JS or Wasm one emerging area of
interest
... common need is to be able to process media somehow, on CPU,
GPU, workers, again JS or Wasm, perhaps dedicated chipset in
the device, or ML algos, or dedicated API such as face
recognition as a higher-level abstraction
... what do I mean with efficiency? avoid duplication of bytes
in memory to use that resource efficiently
... how to run multiple operations per stream or frame(?)
... media stream track and stream are the primitives we deal
with from WebRTC
... do we need to define the mapping between the two?
... how to hook into media from Wasm context?
... also, what do we want to see standardized?
... some of these use cases we'll see today can be already done
today, perhaps not in the most efficient manner
... that's for the introduction
... riju from Intel will show some demos of use cases that can
be implemented today
... paul from Mozilla and harald from Google will also chime in
with their perspectives
See Media Capture for the Web slides, try out the WebCamera demos and check the riju/WebCamera repository
[riju showing slides]
riju: 70% of internet traffic is
media, so processing it efficiently is important
... HDR in browsers is one such experiment, want to capture,
process and display photo in browser
... had to add exposure time feature to getUserMedia
, use
OpenCV.js to create the HDR photo
... to show power of media streams, everything you can do with
native cameras, you can do on the web
... using SIMD and workers
... Facebook, Instragram are using native apps for
media-related apps, but web can also do that today
... looking at fps we see ~15 frames per second (FPS), real-time use cases
require a bit higher performance for reasonable
experience
... in these demos, no ML on the client, all processing on the
CPU
... Facebook messenger funny hats ported to the web was one of
the experiments, 15-20 FPS on a typical laptop
... another use case was scanning, barcode reading, also
possible without hw-accelerated implementation per our
experiments
... yet another use case, camera pan and tilt
... specialized hardware for depth sensing has been enabled for
the web via getUserMedia
extension, use cases e.g. background
segmentation
... less CPU compute needed when processing offloaded to
specialized hardware
... experiments show Wasm with SIMD gave use 5-10x speedup in
multiple use cases
???: which resolution is used in these demos?
????: impressive, what is the laptop specifically the demos were run on?
riju: 10 FPS on modern phones
paul: so I'm from Mozilla, back in the days, Firefox project did not have access to native APIs
paul: how to do competitive
real-time image processing on the web was looked into at
Mozilla around 2012-14
... findings were CPU on mobile was working but not working
well enough, so investigated GPU path and found gaps that we
were able to fix with a proprietary APIs only
... running CPU clocked on screen refresh(?)
... processing model of stacked filters is the most important
part
... not compositing, but stacking within the same group
... we stopped Firefox OS and now we have access to better
technology
... more recently, we started a WebCodecs effort with Google,
separate breakout 16:30 today
... would like to coordinate across these two efforts
Peter: Mapping between Streams
and MediaStreamTrack
is in the WebCodecs proposal, but it
doesn't have to be there. Could be done elsewhere for use cases
that need it but don't need access to codecs.
dom: any other feedback from Audio work that could apply to video processing?
paul: your API must be
programmable
... Web Audio started as a node composition data flow
pairing
... we had lots of nodes, that became an issue
... let's also do a model understanding images are large and we
cannot afford to do redundant copies
See Insertable Media Processing slides
[harald showing slides]
harald: trying to make life
easier for people who use RTCPeerConnection
... if I want to solve one problem only, then this is
important
[harald showing requirements, scribe assumes slides will be made available, so won't scribe the slide content]
harald: we have a solution that
works, as was demonstrated: use media and canvas elements
... we should be able to do better, using those primitives
browsers cannot optimize
... existing apps that want to add image processing step, you
should not need to redesign everything
... we believe composable objects are the future on the
web
... when we see pieces going to the right direction, we should
harmonize
[harald showing drawing of WebRTC Streams flow]
harald: let's give web apps
ability to insert processing between getUserMedia
and
PeerConnection
blocks
... WebCodecs proposal says, encoder is in Stream
... WHATWG Streams are a concept that can be piped
... this is exactly what we need, how can I put this onto the
Web with PeerConnection
... let's do that at insert-time with a Factory pattern
[harald explaining a code example with encoderFactory]
harald: JS has some properties
done right
... processing step can be offloaded to a worker that can
read/write stream
... the goal of the project is to say: given a Stream add this
translation into it -- seems like magic?
... the end
chris: processing in media element, is there a similar solution for that, consider with MSE?
harald: this only for media that
arrives with PeerConnection
... similar type of thinking could be applied there, not
exactly this solution
Francois: To reformulate, the
proposal is to create a hook into PeerConnection
to plug
WebCodecs. A similar hook is needed in MSE and other APIs.
Jan-Ivar: instead of putting this
in sinks, common denominator MediaStreamTrack
(?)
... audio worklets need their own context running off the main
thread
harald: this should be part of
the worker, to solve that architectural issue
... what does PeerConnection
actually do with Workers? A bit
complex
... APIs might allow codec switching on the fly
Youenn_Fablet_Apple: is this
specific to WebRTC? Is MediaRecorder
in scope of the
proposal?
harald: when working with
PeerConnection
, porting over to MediaRecorder
et al. doable
Youenn_Fablet_Apple: funny hats
would probably happen pre-encoding
... WebRTC questions, between PeerConnection
and codec, what
happens on the network level, two params depending on network
connection, how to deal with that?
harald: PeerConnection
framework
would send the messages, if it needs to modify things it
can
Emad_Omara_Google: changing format after encoding?
harald: we haven't defined the format yet, so can do anything
Jan-Ivar: a different API like
VideoWorklet
, would it be a better abstraction?
Youenn_Fablet_Apple: Stream piped was a great model, similar to audio graph, any learning from that?
paul: cloning an area of optimization, separable filters
tidoust: 2 minutes left! Where can we continue
this discussion?
... WebCodecs seem part of the solution, a generic solution may have to encompass other building blocks. We haven't discussed exposing to GPU for instance.
... No real group now, WebRTC, Audio, Media groups available
but split attention, do we want a dedicated CG for incubating
the architecture?
harald: we have a point problem and point solution, we have a field of problems that need a field of solutions
tidoust: any suggestions on a
venue?
... Media WG could be it, but not in scope, maybe M&E
IG?
chris: we could consider to reformulate the IG to be the place
tidoust: most stakeholders in
this room are in Media WG
... thank you all!
... thanks Anssi for scribing!!!