This talk provides an overview of existing, planned or possible hooks for processing muxed and demuxed media in real time or faster than real time in Web applications, and rendering the results. It also presents high-level requirements for efficient media processing.

W3C workshop on
Web and Machine Learning

Media processing hooks for the Web

François Daoust – @tidoust
Summer 2020

Main media scenarios

① Progressive media playback
Playback of media files
<audio>, <video>, HTMLMediaElement in HTML
② Adaptive media playback
Professional/Commercial playback
Media Source Extensions (MSE)
③ Real-time conversations
Audio/Video conferencing
WebRTC
④ Synthesized audio playback
Music generation, Sound effects
Web Audio API

Media content

① Progressive media playback
Media container file (e.g. MP4, OGG, WebM)
Multiplex of encoded audio/video tracks (e.g. H.264, AV1, AAC)
② Adaptive media playback
Media stream (e.g. ISOBMFF, MPEG2-TS, WebM)
Data segments assembled on the fly
③ Real-time conversations
Individual encoded/decoded Media stream tracks
Coupling between encoding/decoding and transport
④ Synthesized audio playback
Short encoded/decoded audio samples

Media encoding/decoding pipeline

In a typical media pipeline, media content first gets recorded from a camera or microphone. This creates a raw stream of audio/video frames. These raw frames are then encoded to save memory and bandwidth, and multiplexed (also known as muxed) to mix related audio/video/text tracks before the result may be sent over the network, either directly to a receiving peer or to a server. The decoding side is roughly symmetric. Media content needs to be fetched, demuxed, decoded to produce raw audio and video frames, and rendered to the final display or speakers.

Media processing scenarios

Media stream analysis
  • Barcode reading
  • Face recognition
  • Gesture/Presence tracking
  • Emotion analysis
  • Speech recognition
  • Depth data streams processing
Media stream manipulation
  • Funny hats
  • Voice effects
  • Background removal or blurring
  • In-browser composition
  • Augmented reality
  • Non-linear video edition

Processing hooks needed!

Most media processing scenarios need processing hooks either on the encoder side between the record and encode operations, or on the decoder side between the decode and render operations.

Existing hooks for…
① progressive media playback

HTMLMediaElement takes the URL of a media file and renders audio and video frames. The browser does all the work (fetch, demux, decode, render). In essence, the media element does not expose any hook to process encoded or decoded frames.

No hooks for progressive media playback…

Existing hooks for…
① progressive media playback

One can still process rendered frames by drawing them repeatedly onto a canvas to access to actual pixels, processing the contents of the canvas, and rendering the result onto a final canvas.

No hooks for progressive media playback…
… but you can create one for video frames with <canvas>.

Existing hooks for…
② adaptive media playback

MSE allows applications to take control of the fetch and demux operations. The rest of the pipeline remains handled by HTMLMediaElement.

No hooks for adaptive media playback…
… but you can also use the <canvas> workaround.

Existing hooks for…
③ real-time conversations

WebRTC takes care of the fecth and decode operations (no demux in WebRTC scenarios). However, decoded frames remain opaque from an application perspective and can only be attached to an HTMLMediaElement in practice.

No hooks for WebRTC either…
… same <canvas> workaround possible.

Existing hooks for…
④ synthesized audio playback

Existing hooks…
summary

Media pipeline in JavaScript / WebAssembly

The entire media pipeline may be implemented in JavaScript / WebAssembly, rendering the result to a canvas

Main requirements for efficient media processing

Raw data processing needs:
1 video frame in a HD video ≈ 1920x1080 x 3 components x 4 bytes ≈ 24MB
1 second of HD video ≈ 25 * 24MB ≈ 600MB

WebCodecs

WebCodecs allow applications to take control of the decode operation in the decoding pipeline and to feed the result into a canvas. As opposed to WebRTC, applications would be able to access decoded frames for processing.

WebCodecs — status

Other media features and specs that may impact processing

See the Overview of Media Technologies for the Web document for details

Conclusion

Thank you!