Memory access patterns in Web Codecs
Presenter: Paul Adenot (Mozilla)
Duration: 8 minutes
Slides: HTML | PDF
Slides & video
Keyboard shortcuts in the video player
- Play/pause: space
- Increase volume: up arrow
- Decrease volume: down arrow
- Seek forward: right arrow
- Seek backward: left arrow
- Captions on/off: C
- Fullscreen on/off: F
- Mute/unmute: M
- Seek to 0%, 10%… 90%: 0-9
Hi, my name is Paul Adenot. I work at Mozilla on anything related to audio and video in Firefox.
Today, I'd like to talk about a recent API called the WebCodecs API. And in particular about memory access patterns when using this API in conjunction with other APIs on the web platform. We'll talk about some raw performance numbers when accessing video frames, what WebCodecs does today to minimize memory, access overhead and some problems that the API has today, where a solution exists, but hasn't been implemented. And more importantly about a couple of harder problems that we'll tackle in the future, so that using WebCodecs will have the same performance profile as a native app.
Before starting, I'd like to say that any simple scenario, such as decoding and rendering video and audio are fairly optimized today, and we'll be talking about advanced use cases, such as big native apps compiled to WebAssembly that leverage WebCodecs to speed up their encoding and decoding operations or other programs that are making a rather intensive use WebCodecs and want to minimize any inefficiencies.
But first let me repeat an old quote often heard in the free software multimedia development circles: memcpy is murder". Behind this tongue-in-cheek sentence lies in important fact: to maximize performance, the minimum of memory copies need to happen. And the working set of an app needs to be as small as possible to ideally fit into CPU caches. Fetching memory into caches is slow, caches are small, so no duplication should happen.
But first let's consider the size of the problem.
A video frame in YUV420, full HD, standard dynamic range is around 4MB. A video frame in YUV420, 4K, standard dynamic range is around 16MB. A video frame in P010, full HD, high dynamic range, (either 10 or 12 bits) is around 32MB.
Here are some performance numbers that I've gathered, writing a small C++ program on my extremely high-end Linux workstation, of copying a single frame of this size. A video frame in YUV420, full HD, standard dynamic range, takes about 1.5ms to copy when the cache is hot, which means the source is already in the cache. And 4.5ms if not. The source needs to be fetched from memory. Two frames easily fit in the cache of this CPU.
A video frame in YUV 420, 4K, standard dynamic range, takes 6.6ms and 17ms when in CPU cache. This only works because the CPU I'm running on has lots of cache, and a single frame still fits.
A video frame in P010, full HD, high dynamic range either 10 or 12 bits, takes 15ms if the caches are hot and 33ms if not. The CPU I'm running this on has 20MB of L3 cache, so only a partial frame fits.
Considering the budget for a single frame at 30Hz is 33ms and 16.6ms at 60Hz, we can quickly see that the numbers here are uncomfortable, and minimizing memory copies are going to be of importance. I'll repeat that standard scenarios in WebCodecs such as playback don't make copies. I'm talking about advanced use cases here, such as processing.
Additionally, GPU to CPU copies (read back) and CPU to GPU copies (texture uploads) are also quite expensive. It's best to carefully consider when and how they need to be done, and to try to minimize transfers, if possible. WebCodecs has been carefully designed to allow to easily leave VideoFrames on the GPU and makes all copies explicit.
But sometimes it really is necessary to perform copies. Here are three scenarios when this is unavoidable.
Custom processing of video frames that are on the CPU via WASM, then the data needs to be copied into the WASM heap, or working with other web API that requires copies.
For example, playing audio data with an AudioWorklet will require copying into the AudioWorklet output buffer.
As said, WebCodecs has been designed with copy minimization in mind. The memory is not visible directly to script, and authors need to call a function called copyTo" to get an ArrayBuffer that can be directly manipulated. This copyTo" method can also take care of conversion in certain cases. when calling clone on a VideoFrame or AudioData object, the underlying resources are referenced for a second time, instead of copied, so a single frame can be used in different context efficiently. Doing a deep copy is still possible.
Now, let's see a number of copies that exist today in WebCodecs.
First, the compressed input of a decoder is currently copied. This is not that problematic because the input is a lot smaller than the output. It could be optimized.
More importantly, there is currently no way to become the owner of the memory behind a VideoFrame or an AudioData, when it's regular memory, non GPU.
Finally, the API for now does a lot of allocations and deallocation, thrashing CPU caches unnecessarily.
Here are two simple design proposals to fix the two last points in this list.
First, a method called detach" could return an ArrayBuffer and close the video frame in one call, skipping a copy when possible, for example, when it hasn't been closed, that's fairly common. Similarly, we can add this method on AudioData.
Next, we can limit the native allocation and deallocation pressure by taking a buffer in the decode method, in which the decoded data is going to be written to, and getting back the input buffer to reuse, in the output callback. This will matter a lot for audio where it's a lot easier for buffer to fit in CPU caches, but their numbers are a lot bigger.
Now let's talk about some of the harder problems faced on the web platform today.
First, the necessity to copy from and to the WASM heap becomes problematic when the data to copy is big enough, such as video frames. Decoding into the WASM heap would be a welcome feature, but this needs some work.
Then there is a problem of passing views over SharedArrayBuffer to APIs. SharedArrayBuffer is often a good solution to limit copies, but it's often unclear what those API do to memory and if they work with concurrent writes to the memory region that has been passed to them. Having read-only memory ranges on SharedArrayBuffer, could be a solution here, but it's a complex problem.
An effort has been started to try to document the problems about copies on the platform at the WICG, and I encourage interested parties to voice their opinions and contribute their experience to the discussion, reports from the field are always welcomed and very important to inform future developments.
In conclusion, while WebCodecs can and will be improved when it comes to memory access, it's already extremely powerful and common use cases are already quite fast. Making it perfect is certainly possible and will probably happen in the future. Thanks.
Interested in sponsoring the workshop?
Please check the sponsorship package.