Report on the W3C/SMPTE Joint Workshop on Professional Media Production on the Web

Report

Executive summary
Introduction
Setting the context
Topics discussed during the live sessions
Other topics
Next steps

Executive summary

W3C and SMPTE organized a Workshop on Professional Media Production on the Web over the course of October and November 2021. This workshop connected the web platform and the professional media production communities and explored evolutions of the web platform to address professional media production requirements.

This virtual workshop kicked off with the publication of 24 workshop talks in October 2021, covering a wide range of media production topics. These perspectives were carefully evaluated, leading to the creation of about 40 GitHub issues, discussed online early November. The workshop culminated in a series of 3 live sessions mid-November 2021 that convened more than 75 experts to exchange on specific media production needs for the web platform. Not every topic could be discussed during the live sessions.

The main outcomes are that:

The web platform already provides building blocks to enable core media production scenarios.
These building blocks are not powerful enough to create full-fledged experiences on client devices (see proxy-based and no-proxy architectures).
Most of the gaps raised during the workshop touch on API features in specifications that are already being developed. There is however benefit to coordinating effort to make sure that media production needs are correctly captured and addressed in ongoing standardization activities.

Workshop discussions call for a more in-depth analysis of some of the topics, and workshop participants propose the creation of a Media Production Task Force that the Media & Entertainment Interest Group could host. The Task Force would be scoped to professional media production using the web platform, and responsible for documenting use cases and needs specific to professional media production, quantifying performance issues, promoting proposals to working groups and implementers, and tracking standardization progress and implementations.

Introduction

W3C and SMPTE hold workshops to discuss a specific space from various perspectives and identify needs that could warrant standardization efforts, and assess support and priorities among relevant communities. The Workshop on Professional Media Production on the Web took place in October and November 2021. Main goal of the workshop was to connect the web platform and the professional media production communities and explore evolutions of the web platform to address professional media production requirements. The workshop was held as a virtual event with a combination of pre-recorded talks, online discussions on GitHub, and a series of 3 live sessions to dig into specific media production needs for the web platform.

This report summarizes topics discussed during the live sessions, reviews topics that could not be discussed for lack of time, and proposes next steps.

Setting the context

The web has become a major platform for consuming media. Web technologies at the heart of this revolution (such as media elements in HTML, Media Source Extensions, WebVTT, etc.) are progressively being extended or completed with additional technologies such as WebCodecs to provide web applications with finer-grained control over media experiences.

Meanwhile, storage and processing of movie and TV production assets has moved to the cloud. The web platform provides a natural environment to interact with these assets. Accordingly, there is a growing interest in building web applications that allow end-users to manipulate production assets, e.g. editing, quality checking, versioning, timed text authoring, etc.

Professional applications require additional capabilities, including precise timing, high-fidelity timed text, efficient media processing solutions, wide color gamut and high-dynamic range, etc. Exact capabilities depend on the architecture being considered:

In a proxy-based architecture, production assets remain in the cloud. Client devices act as remote controllers for processing operations and operate on lower-resolution versions of the media assets stored in the cloud.
In a no-proxy architecture, processing of media assets happens on the client device, which needs to process the high-resolution media assets directly and accurately.

This workshop explored specific capability requirements for media production in both architectures and evolutions of the web platform to address them.

Topics discussed during the live sessions

WebCodecs

In his introductory talk on WebCodecs, Chris Cunningham asks the media production community about additional encoder options that it may need. Workshop participants suggest a quality control knob for applications to provide a hint to the browser that they would like to favor quality of the encoding over encoding latency. This seems doable. Discussion of possible API shapes is ongoing. Stakeholders are encouraged to subscribe and contribute.

WebCodecs gives applications decoding (and encoding) capabilities over the bitstream, but the bitstream is only available after demuxing, which WebCodecs leaves up to applications. A recurring ask from application developers is for a demuxing and muxing API. The browser already handles these steps when applications use Media Source Extensions or the decodeAudioData method in the Web Audio API. Chris Cunningham notes that applications may leverage existing libraries such as MP4Box.js or FFmpeg. That said, these libraries are either too specific or too broad to handle usual cases, and their integration in applications is hard. Paul Adenot notes that Firefox demuxes content using WebAssembly code already, with no noticeable performance impact, and that memory copies are hardly a concern for encoded streams. All in all, this space still needs to be explored. A dedicated open source muxing/demuxing library, perhaps based on the libavformat library, may be needed. Alternatively, WebCodecs could perhaps be extended with a demuxing/muxing API if the exploration reveals that demuxing/muxing at the application level is impractical.

Professional codecs used in media production workflows differ from those used in media distribution and include formats such as Adobe ProRes or JPEG 2000. Could browsers support media production codecs? This seems hard to achieve. The list of codecs supported by a browser in WebCodecs will most likely match the list of codecs it supports for media playback, and there are many considerations that browsers take into account to support a codec format. Alternatively, browsers could perhaps expose hooks to available codecs on the system. At a minimum, for this to be envisioned, a common abstraction layer needs to exist across codec libraries. James Pearce and Paul Adenot also point out that running third-party code in browsers may introduce security risks.

Metadata may appear at different layers (see the Metadata section below). At the codec level, metadata may appear in Supplemental Enhancement Information (SEI) messages. Could SEI messages be exposed by WebCodecs instead of requiring applications to parse the bitstream? Exact use cases, such as access to closed captions and HDR parameters, need to be investigated. The Media & Entertainment Interest Group organized a follow-up discussion in December to review a proposal from Yuhao Fu (ByteDance), and will follow up as part of its Media Timed Events Task Force.

Some media files are encoded using variable bitrate. Nigel Megitt asks whether seeking to a specific time could be better supported in such cases. There is no magical solution in general at the codec level. Mechanisms that could improve seeking are typically found at the container level. For example, MP3 files may contain a table of contents that applications could parse and use to locate the appropriate chunks right away.

A workshop participant asks about encoding/decoding support for audio samples and video frames that some codecs may require at the start of the media content to bootstrap the decoder, sometimes referred to as priming (in the audio domain) or pre-roll. Such samples and frames need to be decoded but skipped during playback. Paul Adenot explains that WebCodecs acts as a pass-through to underlying codecs APIs and that applications need to know and handle pre-roll requirements of the codecs that they are using. Additional tests would help explore practical implications at the application level.

Yuhao Fu points out that it is sometimes useful to retrieve decoded frames from the video element itself. Paul Adenot explains that, once standardized, the breakout box proposal, recently adopted by the WebRTC Working Group shortly after the workshop, could be used to construct a MediaStreamTrackGenerator that would give applications access to decoded frames, from a MediaStream retrieved through the HTMLMediaElement.captureStream() method. Another option would be to extend the video.requestVideoFrameCallback() method to also return a VideoFrame construct (defined in WebCodecs).

Web Audio API

Once WebCodecs is widely supported, the decodeAudioData method could in theory be deprecated. That said, the decodeAudioData method has built-in support for demuxing which is convenient in a number of scenarios that need to access decoded audio samples and methods don't usually disappear from the web platform once they are widely deployed and used. The method should remain part of the web platform for the foreseeable future.

Audio accuracy is critical in professional audio workflows managed by Digital Audio Workstations (DAW), for instance to align content being recorded with content being played back and visualizations rendered on screen. This is easier said than done in the generic case as it supposes that the input latency, the intrinsic latency of audio nodes and the output latency are all known. Relevant hooks are already specified but not supported on all browsers. Hongchan Choi shares Chrome's intent to ship support for outputLatency and the shape of a render capacity API that should soon be added to the Web Audio API. That said, Paul Adenot points out that a recurring issue is that numbers reported by the system for input/output devices are often not reliable, making it hard to expose meaningful measures in browsers.

The Audio Working Group has agreed to expose AudioContext in workers, which would allow DAWs to avoid tying audio processing to the main UI thread. Updates to the Web Audio API specification are ongoing.

Kazuyuki Ashimura wonders about support for synthesized speech in the Web Audio API. Paul Adenot explains that current systems are not well-suited for processing as browsers may not even see synthesized audio samples before they reach the speakers or headset. This could be discussed during a possible workshop on voice interaction in 2022.

James Pearce asks about DSP format support for custom audio processing. Browsers do not have native support for specific DSP formats but the processing code may be written in any format in practice. Various libraries are available that handle FAUST, PureData or C++ for instance.

Media synchronization

Sacha Guddoy describes use cases where a video player sits along an audio level display, and where audio playback needs to be precisely synchronized with video playback and DOM updates. Paul Adenot explains how the output latency exposed by the Web Audio API may be used to delay DOM updates and video frame rendering (through WebCodecs) to synchronize video and audio playback. The HTMLMediaElement.requestVideoFrameCallback() proposal could be used to simplify the synchronization logic in video-related cases.

Sacha Guddoy also explains how multiple WebRTC streams need to be synchronized in live vision mixing applications, e.g. when multiple cameras are used. If the proposal gets more widely adopted —and provided camera clocks are also synchronized— the Absolute Capture Time extension could be used to stamp RTP packets with a NTP timestamp. Coupled with Harald Alvestrand's breakout box model, adopted by the WebRTC Working Group shortly after the workshop, this would allow applications to delay and synchronize rendering of media streams.

General synchronization between audio/video and metadata remains an open question. For instance, while media streams are synchronized in WebRTC, data channels are not synchronized with media streams. The ability to expose SEI metadata along with decoded frames in WebCodecs could provide useful synchronization hooks.

Synchronization accuracy needs depend on scenarios and influence the synchronization hooks to expose and/or use. Targeted accuracy levels need to be clarified on a case by case basis. Audio may have hard realtime requirements while some video synchronization scenarios may be content with ~100ms accuracy. Other video scenarios may require rough or precise frame accuracy levels.

WebRTC

Sergio Garcia Murillo introduces WHIP, a proposal to converge on a signaling protocol for WebRTC. The protocol could be integrated in media production hardware to leverage WebRTC out-of-the-box. This would also create a virtuous circle to support and expose additional capabilities needed for media production. Standardization work on WHIP is ongoing at the IETF.

Additional capabilities for media production include support for production quality codecs at higher frame rates, support for multi-channel audio (surround) or object-based audio, support for High-Dynamic Range (HDR) and Wide Color Gamut (WCG) media encoding, or support for video with transparency.

Also, WebRTC has no proper support mechanism for real-time captioning. An RTCDataChannel may be used to stream cues but the data channel is not synchronized with audio/video tracks (see Media synchronization section above). The Timed Text Working Group develops a TTML Live Extensions Module for TTML content but there is no standardized way to stream WebVTT. How can real-time captioning be integrated in WebRTC?

More advanced scenarios need control over the jitter buffer, for instance to prevent audio distortion when WebRTC is used in musical and other professional audio contexts.

With the exception of real-time captioning, WebRTC features are already defined in draft specifications (e.g. in the WebRTC Extensions) or do not require major updates to existing specifications (e.g. support for codecs). The media production industry still needs to weigh in to prioritize features in implementations. More exploratory work is also needed to clarify requirements for some scenarios (e.g. support for object-based audio).

WebAssembly

Kevin Streeter explains how WebAssembly (WASM) can be used to port authoring applications from the desktop to the web. Lots of optimizations have been integrated into native applications over time. Some of these optimizations get lost in the web version due to missing features in WebAssembly, sometimes resulting in workflows that may run 4 to 5 times slower than on native.

The first missing feature is 64-bit support for heap management. WebAssembly has built-in support for 64-bit numbers, which can be used to speed up pixel processing computations. However, 64-bit memory addresses are still being specified and not yet supported by browsers.

Another missing feature is advanced SIMD support. Luke Wagner explains that the initial batch of SIMD support in WebAssembly was the largest intersection that the group could find that was portably fast across a variety of desktop CPUs. The SIMD sub-group of the Web Assembly Group meets bi-weekly to develop the next generation of SIMD instructions in WebAssembly, covering three main dimensions: supporting instructions that are platform-specific, allowing instructions that are non-deterministic, and relaxing the vector size. The SIMD sub-group welcomes practical WebAssembly workloads to guide its work.

Last but not least, on the web, media production applications will never be pure WebAssembly applications. They will also leverage GPU computations through WebGL or WebGPU, Web APIs that run on the CPU, and multi-threaded operations through Workers. Memory copies are typically needed whenever memory boundaries get crossed, and media production workflows manipulate a lot of memory, especially when decoded video frames get processed.

Luke Wagner details solutions envisioned to reduce memory copies across boundaries. In theory, changes to compilers could emerge that would allow applications to reference memory pages that are outside of WebAssembly's linear memory. In practice, this seems far-fetched given the amount of work needed. A more promising solution would be to refine operations that create copies such that, under the hoods, browsers may use memory-mapping (mmap) whenever possible. This approach may require updating lots of specifications and is hard to implement, but it would not need to be specific to WebAssembly. In situations where the data needs to be transformed, another approach would be to delay the copy so that the copy and transform operations get fused.

The Web Platform Incubator Group (WICG) hosts a coordination effort on reducing memory copies. Interested parties are encouraged to join discussions in that repository.

File system integration

As with WebAssembly, Kevin Streeter discusses common file system integration issues that emerge when native authoring applications get ported to the web. The problems boil down to the need to handle, process, and transport very large file assets, while optimizing the number of I/O operations to improve performances. Marijn Kruisselbrink presents the Origin Private File System, defined in the File System Access API proposal. The API is designed to work better with large files and be able to read and write to them with minimal overhead. That said, a copy still needs to happen when external data needs to be imported into the Origin Private File System. Constraints could perhaps be loosened in read-only scenarios. Writing to files outside the Origin Private File System is more tricky.

The Origin Private File System may graduate to WHATWG soon. Support for this and additional features highly depends on browser vendors interest. The media production industry needs to weigh in to prioritize the feature in implementations.

Metadata

Bruce Devlin outlines a core issue with metadata: metadata that gets produced at the capture phase is easily lost in subsequent media processing steps. People often need to re-create it afterwards, which is at best inconvenient and costly. Metadata may also be lost during transport or out of reach from applications during playback. How can metadata be preserved?

Bruce Devlin categorizes metadata along two axes: the format axis (text or binary?), and the time axis (isochronous with each frame or more irregular or lumpy with embedded timing?). Solutions to manage and expose metadata may depend on where the metadata sits along these two axes, and what use cases are envisioned for that metadata. One use case example is adding provenance and authencity information, which the Coalition for Content Provenance and Authenticity (C2PA) is currently exploring. The provenance information could be visualized during media playback or when the user presses pause. Synchronization between the metadata and the frame is primordial in such scenarios.

Standardization efforts could focus on defining APIs that expose the different categories of metadata. For instance, the DataCue API proposal could expose metadata at the container level. Support for SEI metadata in WebCodecs (discussed during the workshop) could expose metadata that sits at the codec level.

Metadata also needs to use standardized vocabularies so that media production workflows can be defined in more abstract transformation terms and be applied broadly to various source of inputs and outputs. Julian Fernandez-Campon shows how standardardized vocabularies can be used to introduce processing steps in workflows that can leverage a variety of tools and services. For media content, the SMPTE ST 2065 (ACES) standard seems like a good option. Brendan Quinn points to the IPTC Video Metadata Hub. Other standards may be used. Should W3C develop a mapping vocabulary between existing standards?

Accessibility

Ed Gray reviews existing accessibility guidelines, notably the Web Content Accessibility Guidelines (WCAG) and Authoring Tool Accessibility Guidelines (ATAG), communities of practice like the International Association of Accessibility Professionals (IAAP) and Web Access In Mind (WebAIM), and self-reporting formats for accessibility such as Voluntary Product Accessibility Template.

Accessibility in media authoring tools affects many layers, ranging from the contrast and keyboard navigation considerations to closed captions support and announcing when a camera is being plugged in. It is a never ending task, best addressed when companies invest in dedicated teams and when accessibility measures actually benefit everybody, as developed in the Principles of Universal Design.

Next steps

Existing technologies

A main takeaway from the workshop is that the web platform already provides suitable building blocks to enable core media production scenarios in a proxy-based architecture: media streaming and rendering technologies (e.g. the video element, canvas based rendering, MSE, the Web Audio API), transport technologies (Fetch, WebRTC), processing technologies (through JavaScript, WebAssembly, or WebGL) and storage technologies (File API, IndexedDB) are widely supported and used across authoring applications.

It seems clear, however, that the web platform cannot easily accommodate a no-proxy architecture today for professional media production scenarios. Technical gaps raised during the workshop mean that the scenarios can be achieved in web applications but only to some extent. For instance, media authoring applications running on client devices may need to clamp the resolution of videos e.g. to 480x270 when they have to decode and process the video themselves, because they cannot leverage hardware decoders. They may also run into hard-to-solve synchronization issues, jeopardize color fidelity, or may run poorly compared to native applications because they cannot easily leverage optimizations for processing media such as advanced SIMD instructions.

Ongoing standardization efforts

Workshop talks and discussions show that ongoing standardization efforts will bring advanced features and performance improvements that the media production industry needs. These include low-level access to media as exposed by WebCodecs, better latency measurement capabilities in the Web Audio API, enhanced performances in WebAssembly (advanced SIMD, 64-bit memory heap support), smoother UIs when APIs are all available in workers, and production quality support in WebRTC (multi-channel audio, higher frame rate codecs support).

These ongoing standardization efforts span multiple groups including the Media Working Group (e.g. WebCodecs, Media Capabilities), the Audio Working Group, the WebAssembly Working Group, the WebRTC Working Group, the Accessibility Guidelines Working Group (WCAG), the GPU for the Web Working Group (WebGPU), the Timed Text Working Group, or the Web Platform Incubator Community Group (WICG) for pre-standardization efforts (e.g. Origin Private File System).

These groups operate in public and happily take input on their deliverables, usually through issues raised on GitHub repositories. Such an individual approach works well to report specific needs and some workshop participants already provided feedback on WebCodecs or the Web Audio API.

An individual approach is not always enough. Groups may need more input to evaluate a feature to confirm that the need is widely shared across the industry, to evaluate whether leaving the feature up to applications can be a reasonable tradeoff between performance and interoperability, or to explore alternative designs. Also, some features only make sense when viewed from a broader media production perspective, which groups do not necessarily have.

To go beyond individual contributions, a coordination effort is needed. Coordination points already exist:

The reducing memory copies initiative in WICG coordinates discussions on mechanisms to reduce copies across memory boundaries.
The Color on the Web Community Group coordinates efforts on HDR and WCG across the platform.
Within the Media & Entertainment Interest Group, the Media Timed Events Task Force coordinates discussions on metadata exposure.
More broadly speaking, the Media & Entertainment Interest Group acts as steering group for media standardization efforts within W3C
Similar efforts exist in W3C, SMPTE, or elsewhere, focused on a specific topic.

Towards a Media Production Task Force

Coordination points mentioned above target topics that overlap with those raised during the workshop. There is, however, no coordination effort that looks at professional media production needs on the web as a whole. The workshop was a one-time coordination effort but it seems clear that a more in-depth analysis is needed for some of the discussions started during the workshop. Also, the workshop could not cover all relevant topics for lack of time.

Workshop participants propose that a coordination effort on web-based professional media production be created. The scope of this effort would match that of the workshop: professional media production using the web platform, including editing, quality control, grading/color correction, dailies, visual effects, sound, mastering, translation and servicing. Cloud-based processes and desktop applications that do not use the web platform would be out of scope. Similarly, applications that do not manipulate the content such as file sharing applications would be out of scope.

This coordination effort would be responsible for:

Connecting the web platform and the professional media production communities,
Documenting use cases and specific needs for professional media production,
Quantifying performance needs when feature can already be achieved using existing technologies at the application level,
Prioritizing and promote proposals to working groups and implementers,
Tracking progress and implementations.

Looking at specific topics raised during the workshop, this coordination effort could:

Run code experiments on muxing and demuxing at the application level to inform the possible creation of a (de-)muxing API in WebCodecs.
Evaluate the quality control tuning knob under discussion in WebCodecs.
Gather media production use cases around metadata management, notably on SEI metadata and on the encoding side, to be fed into discussions of the Media Timed Events Task Force and Media Working Group.
Document needs for professional codecs and monitor support in implementations.
Document real-time captioning requirements for WebRTC, in collaboration with the WebRTC Working Group.
Explore synchronization needs and gaps, using code to quantify issues.
Gather typical memory workloads to analyse performance issues when memory boundaries (CPU, GPU, WASM) need to be crossed, and help relevant Working Groups adjust their APIs to avoid problematic copies.
Document file asset management issues that are specific to media production due to the size of the assets.
Make sure that the media production angle is considered in Color on the Web Community Group discussions.
Explore needs for standardized vocabularies in media production workflows.
Explore more specific needs such as pre-roll/priming, or the ability to get a decoded frame from a <video> element.

For better impact, this coordination effort should gravitate around the Working Groups that develop the APIs and main coordination points that already address topics of interest. It is proposed that the W3C Media & Entertainment Interest Group hosts this coordination effort: it fits its mandate as steering group for media standardization efforts in W3C and the envisioned scope does not include development of technical solutions, which would rather find a natural home in the Working Groups responsible for the underlying standards (Media Working Group, Audio Working Group, etc.).

Thank you!

The organizers express deep gratitude to those who helped with the organization and execution of the workshop, starting with the members of the Program Committee and speakers who provided initial support and helped shape the workshop. Huge kudos to Chris Needham and Pierre-Anthony Lemieux for chairing the workshop, and to Adobe for sponsoring. Many thanks to those who took an active role under the hood, notably events teams and BizDev teams at W3C and SMPTE, Marie-Claire Forgue for editing the videos after the workshop and all W3C/SMPTE team members who took part in the workshop one way or the other. Finally, a big thank you to workshop participants without whom the workshop wouldn't have been such a productive and inspiring event. Congratulations to all, first shot was good, more shots needed!

Sponsor

Interested in sponsoring the workshop?
Please check the sponsorship package.