This talk provides an overview of existing, planned or possible hooks for processing muxed and demuxed media in real time or faster than real time in Web applications, and rendering the results. It also presents high-level requirements for efficient media processing.

W3C workshop on
Web and Machine Learning

Media processing hooks for the Web

François Daoust – @tidoust
Summer 2020

Main media scenarios

① Progressive media playback: Playback of media files; <audio>, <video>, HTMLMediaElement in HTML
② Adaptive media playback: Professional/Commercial playback; Media Source Extensions (MSE)
③ Real-time conversations: Audio/Video conferencing; WebRTC
④ Synthesized audio playback: Music generation, Sound effects; Web Audio API

Media content

① Progressive media playback: Media container file (e.g. MP4, OGG, WebM); Multiplex of encoded audio/video tracks (e.g. H.264, AV1, AAC)
② Adaptive media playback: Media stream (e.g. ISOBMFF, MPEG2-TS, WebM); Data segments assembled on the fly
③ Real-time conversations: Individual encoded/decoded Media stream tracks; Coupling between encoding/decoding and transport
④ Synthesized audio playback: Short encoded/decoded audio samples

Media encoding/decoding pipeline

In a typical media pipeline, media content first gets recorded from a camera or microphone. This creates a raw stream of audio/video frames. These raw frames are then encoded to save memory and bandwidth, and multiplexed (also known as muxed) to mix related audio/video/text tracks before the result may be sent over the network, either directly to a receiving peer or to a server. The decoding side is roughly symmetric. Media content needs to be fetched, demuxed, decoded to produce raw audio and video frames, and rendered to the final display or speakers.

Media processing scenarios

Media stream analysis

Barcode reading
Face recognition
Gesture/Presence tracking
Emotion analysis
Speech recognition
Depth data streams processing

Media stream manipulation

Funny hats
Voice effects
Background removal or blurring
In-browser composition
Augmented reality
Non-linear video edition

Processing hooks needed!

Most media processing scenarios need processing hooks either on the encoder side between the record and encode operations, or on the decoder side between the decode and render operations.

Existing hooks for…
① progressive media playback

HTMLMediaElement takes the URL of a media file and renders audio and video frames. The browser does all the work (fetch, demux, decode, render). In essence, the media element does not expose any hook to process encoded or decoded frames.

No hooks for progressive media playback…

Existing hooks for…
① progressive media playback

One can still process rendered frames by drawing them repeatedly onto a canvas to access to actual pixels, processing the contents of the canvas, and rendering the result onto a final canvas.

No hooks for progressive media playback…
… but you can create one for video frames with <canvas>.

Existing hooks for…
② adaptive media playback

MSE allows applications to take control of the fetch and demux operations. The rest of the pipeline remains handled by HTMLMediaElement.

No hooks for adaptive media playback…
… but you can also use the <canvas> workaround.

Existing hooks for…
③ real-time conversations

WebRTC takes care of the fecth and decode operations (no demux in WebRTC scenarios). However, decoded frames remain opaque from an application perspective and can only be attached to an HTMLMediaElement in practice.

No hooks for WebRTC either…
… same <canvas> workaround possible.

Existing hooks for…
④ synthesized audio playback

The Web Audio API is a processing API at its heart
Custom processing code can be injected through audio worklets
No support for streaming and no indication of progress
Good fit for short audio samples, not long streams of audio

Existing hooks…
summary

Dedicated API for audio but not geared towards streaming scenarios
No existing processing hook at the right step otherwise
Rendered video frames may be copied to a <canvas> and processed
(but that is not very efficient ☹)

Media pipeline in JavaScript / WebAssembly

The entire media pipeline may be implemented in JavaScript / WebAssembly, rendering the result to a canvas

Applications may also implement the entire media pipeline in JS/WASM
Obviously allows to have processing hooks wherever needed
Increased bandwidth, bad performance, reduced power efficiency ☹

Main requirements for efficient media processing

Raw data processing needs:
1 video frame in a HD video ≈ 1920x1080 x 3 components x 4 bytes ≈ 24MB
1 second of HD video ≈ 25 * 24MB ≈ 600MB

Avoid copies in/across memory (e.g. CPU, GPU, Worker, WebAssembly)
Expose as raw data as possible (e.g. YUV or ARGB for decoded video)
Leverage processing power (e.g. workers, WebAssembly, GPU, ML)
Difficulty to create inefficient processing pipelines
Control over processing speed (real-time or faster than real-time)
Stream processing (as opposed to processing of entire files)
Processing should not de-synchronize media streams

WebCodecs

Efficient access to built-in (software and hardware) media encoders/decoders
Exposes needed media processing hooks

WebCodecs — status

Interfaces being shaped as this talk is recorded
Lots of open questions, e.g. around integration with other specs such as:
- WebAssembly
- MSE
- WebRTC — work started on WebRTC Insertable Media using Streams
- WebXR
- and of course WebNN
Not based on WHATWG Streams
Incubation in WICG, potential deliverable for Media WG

Other media features and specs that may impact processing

Codec switching in MSE
HTMLVideoElement.requestVideoFrameCallback()
Immersive videos: 360°, volumetric, 3D
Media Timed Events
Encrypted media

See the Overview of Media Technologies for the Web document for details

Conclusion

Exposing decoded frames to applications is easier said than done
No satisfactory solution for media processing today
WebCodecs to the rescue!
Coordination needed for integrating WebCodecs in other technologies

Thank you!

Media processing hooks for the Web

Main media scenarios

Media content

Media encoding/decoding pipeline

Media processing scenarios

Processing hooks needed!

Existing hooks for…① progressive media playback

Existing hooks for…① progressive media playback

Existing hooks for…② adaptive media playback

Existing hooks for…③ real-time conversations

Existing hooks for…④ synthesized audio playback

Existing hooks…summary

Media pipeline in JavaScript / WebAssembly

Main requirements for efficient media processing

WebCodecs

WebCodecs — status

Other media features and specs that may impact processing

Conclusion

Existing hooks for…
① progressive media playback

Existing hooks for…
① progressive media playback

Existing hooks for…
② adaptive media playback

Existing hooks for…
③ real-time conversations

Existing hooks for…
④ synthesized audio playback

Existing hooks…
summary