A report from Dom and François on their explorations on video processing during Geek Week 2022.

Leaving presentation mode.

Processing video streams

Geek Week 2022 explorations

Dominique Hazaël-Massieux
François Daoust

Initial goals

  1. Play with real-time processing of video frames
     
  2. Get hands-on with recent Web technologies
     
     
  3. Better understand discussions that mention esoteric concepts such as jitter buffers or per-frame QP
     
  4. Investigate synchronization of audio and video streams

Actual achievements

  1. Play with real-time processing of video frames
    ➜ Building blocks to create a video processing pipeline
  2. Get hands-on with recent Web technologies
    ➜ WebCodecs, MediaStreamTrack Insertable Media Processing using Streams, WebGPU, Streams, workers, WebTransport
  3. Better understand discussions that mention esoteric concepts such as jitter buffers or per-frame QP
    ➜ Mechanism to measure processing times per frame
  4. Investigate synchronization of audio and video streams

Why process media streams?

Media stream analysis

Media stream manipulation:

APIs hide pixels by default

Too much memory
1 raw HD video frame ≈ 1920x1080 x 4 bytes ≈ 8MB
1 second of raw HD video ≈ 25 * 8MB ≈ 200MB
Not a single pixel format
Memory layout (YUV, ARGB, RGBA), color depth (HDR/SDR), etc.
Not always readily available for exposure to JavaScript
Browser-bound CPU memory / GPU memory
Not always in direct control of the browser
Hardware-based decoding

WebCodecs

WebCodecs defines raw media interfaces and in particular:

VideoFrame

Connection with WebRTC through the MediaStreamTrack Insertable Media Processing using Streams specification, which defines VideoTrackGenerator & MediaStreamTrackProcessor

General concept

Input

Generate a stream of VideoFrame objects.

Process each VideoFrame

Manipulate the bytes exposed by the VideoFrame object, using JavaScript, WebAssembly, WebGPU, WebNN...

Output

A stream of processed VideoFrame objects.

Some abbreviations

VTG = VideoTrackGenerator

MSTP = MediaStreamTrackProcessor

TS = TransformStream

WT = WebTransport

JS = JavaScript

Available dominoes

VideoFrame stream
getUserMedia()
VTG
MSTP
VideoEncoder TS
VideoDecoder TS
WTSendStream
WTReceiveStream
<video>
JS + <canvas>

Stream connectors

WHATWG Stream?

Plug Construct
None VideoFrame
(in WebCodecs)
Stream of VideoFrame
(used by VTG, MSTP)
Stream of encoded chunks
(VideoEncoder + TS)
MediaStreamTrack
(in WebRTC)

Why the difference?

  • Requirements: Streams propagate backpressure signals through the chain. Some media processing scenarios require additional control signals (e.g. configure, flush, reset).
  • WHATWG Streams seen as too complicated.
  • History: WHATWG Streams did not exist when WebRTC started.

Creating a stream of video frames

From scratch:

VideoFrame stream

From camera:

getUserMedia()
MSTP

From a received stream:

WTReceiveStream
VideoDecoder TS

Notes:
- RTCDataChannel could be used as well.
- Also possible from containerized media but no API to demux the media to create a stream of encoded chunks, so up to the application.

Processing video frames

Idea is to use a TransformStream that takes a stream of video frames as input and produces another stream of video frames as output.

TransformStream

Processing can be chained if needed:

TransformStream
TransformStream
TransformStream

Actual transformation logic can use pure JavaScript, WebGPU, WebNN, WebAssembly, etc.

Sending/Rendering final video

Render to display:

VTG
<video>

Or:

JS + <canvas>

... but implementing a full video player in JavaScript is no easy task! (e.g. sync, accessibility, controls)

Send somewhere:

VideoEncoder TS
WTSendStream

Note: RTCDataChannel could be used as well

Demo

Demo: https://tidoust.github.io/media-tests/
Code: https://github.com/tidoust/media-tests/

Measuring jitter

Time stats

Typical run with overlay and H.264 encoding/decoding
Timer Frames Min. Max. Avg. Median
overlay 104 4 45 10 7
encoding 104 15 245 20 17
decoding 104 1 23 1 1
queued 104 0 288 26 13
end2end 104 24 338 58 151
displayed during 101 20 59 38 39

Times in milliseconds

Key takeaway

Interconnections between APIs is not straightforward:

Other takeaways

Related discussions

Media WG and WebRTC WG have started to discuss joint architectural considerations for the evolution of the media pipeline on the web.

Repository:
https://github.com/w3c/media-pipeline-arch

Thanks!