Memory copies & zero-copy operations on the Web

Facilitator: François Daoust

Speakers: Bernard Aboba

In scenarios such as real-time processing of media frames with ML algorithms, memory copies made along the processing pipeline may account for a non negligible part of the overall performance budget of the said processing. See Memory copies discussion in the Web and Machine Learning workshop for context. Could the Web do better?

Minutes (including discussions that were not audio-recorded)

Previous: Cloud Gaming on the Web All breakouts Next: Revenue Models for the Web

Transcript

Fantastic.

So to give you a little bit of context, at the origin of this breakout session, there is the recent workshop on Web and Machine Learning that Danam organized in September.

And during this workshop, various speakers raised the topic of efficiencies, they raised efficiency issues when they try to process in particular media in real-time with machine learning obviously.

And so for instance, Bernarda Boba raised specifically made a reference to the impact of memory copies in weather environment.

And another speaker, name is Daryl Pavlovian.

And I think from counterpoint shared that in one of the experiences that there have been trying, which is around generating music from video frames.

The cost of moving the data around is as high as the cost of machine learning in France itself.

So actually moving the bytes around takes as much time as doing the actual processing of these bytes, which is not fantastic in practice.

So the issue is not restricted to machine learning.

So, I know that for instance, the GPU for the well working group has been discussing the impact of copies in the context of how to use the content of the canvas or a video element as a GPU texture.

Last week there has been an extensive discussion about that.

More importantly, that seems to be the kind of issue that spans multiple groups and that as a result may not have a clear owner.

So we thought Dom and I thought that it could be useful to convene people here from different perspectives to discuss.

And as well, exactly why we're here.

So I would like to take 10 to 15 minutes to introduce the situation as I understand it.

And since we're recording the talk, I'll go through it and perhaps it's better if you keep the questions for later, or as you done with saying, you can chime in, but remember that you are going to be recorded.

And then the goal is to discuss?

What I think would be useful outcome for this breakout session is to clarify whether things are under control all ready, meaning that people are actually discussing and each group is doing what it should be doing and there's no reason to worry about memory copies in general.

Or whether actually there is a need to worry about memory copies in general.

And we should be looking for some kind of coordination effort, somewhere somehow with some people to make sure that things are being handled correctly.

So to try to reason about this I thought I would try to represent the memory, I mean and the different components that may trigger a memory copy that may explain why we do memory copies.

So this visualization is going to be extremely rough, incomplete and even possibly wrong, but that's okay it's to trigger discussion.

So that's fine if it's not the right way to think about it.

So anyway, this first block was the memory.

This block is actually divided into two, there is the CPU memory and the GPU memory.

In most systems and most system, there are physically distinct.

If you need to process the CPU based buffer with the GPU, you will have to upload the buffer first to the GPU memory before you can process it for the GPU and vice versa obviously.

And then I try to represent the browser in there.

And the browser needs to deal both with the CPU memory and the GPU memory.

And it has actually to manage different internal isolated runtimes.

I don't know if runtime is the right term for what I wanted to represent, so I used land here.

So in the browser, you're going to have the JavaScript land, you have the Web Assembly land, and you'll have the WebGL and WebGPU land.

The first, the JavaScript land and the Web Assembly land are tied to the CPU memory.

The webGL and WebGPU land is tied to the GPU memory.

Going down a bit further, the JavaScript land is actually divided into further regions, thanks to workers.

And browsers leverage the available hardware to provide inputs, outputs, and run specific processing algorithms.

That's what appears on the right of this diagram there.

So far, I/O, we're talking about network cards, we're talking about microphones, cameras, speakers, and screens for processing.

We are talking about media encoders and decoders, which are often hardware-based between embedded devices.

And it seems to me that I didn't exactly know where to draw these hardware boxes It seems to me that at least theoretically, it's doable to tie them to either the CPU memory or the GPU memory, but I'm happy to be wrong, or I guess, I mean, feel free to correct me.

So all of these blocks need to communicate with the browser acting as mediator.

So in non optimized version of a browser, memory copies are going to happen whenever one of these boundaries get crossed.

And when some, for instance, when some JavaScript code needs to exchange with a Web Assembly module, then a copy is going to be made.

When some I/O gets processed by the browser and handed over to the JavaScript, then the copy needs to be made.

But that's in a non-optimized version.

So if I now use that as a way to reason about memory copies, I think that the main reason my memory copies are needed are the ones that appear on this slide.

There are the transfer across memory boundaries we just serve that.

And some of them are physical so that means they there's no way that we can remove them.

So that's the transfer between CPU memory and GPU memory.

And others are more artificial, but isolation may be needed if for instance for security reasons.

Copies can also be triggered by the need to transcode internal structures.

So that may be the case between JavaScript and web, there's some difference.

A buffer doesn't look the same in a JavaScript environment.

And you know, Web Assembly environment.

Or you may need to, browsers may need to transcode, for instance, a decoded video frame that is in RGBA into YUV or vice versa.

So there more generically there is often a need to process streams to decode them, to encode them, to decrypt, to encrypt.

And these operations are likely to trigger copies that are probably unavoidable.

There may be other reason why copies are made.

One, I think that has happened at least in the past, or that's still continued to be done this way is you're probably wanting some APIs to have a guarantee that whenever the application hands over a buffer to an API function, then this API function and the taking for granted that this API function is going to do the work asynchronously.

The CPA function is going to start by making a copy of the bytes so that if the app changes the buffer just after the call, that it doesn't affect the outcome of the API function.

So we provide some guarantee as to what's going to happen.

And that probably, that may explain where some copies are made too.

And perhaps it's more generically, it's actually easier for developers to just have APIs that actually require somehow memory copies because they're going to be easier to use.

Or of course, last reason is that it may be the cost of copies may seem negligible, and there may be good reasons to do copies because it's easier to implement for instance.

And that in that case that means just an implementation issue.

And there's probably not much that we can do, we need to do from a specification perspective.

So if I now look at reducing what it would take to reduce memory copies.

Looking back at my fantastic diagram, that means getting direct communication lines across boundaries for me.

So for instance, in a media processing scenario.

If media decoding is going to take place in the GPU memory, then perhaps the ideal processing pipeline would be to do everything on the GPU from the initial fetch to the final rendering to the screen.

So again, some copies cannot be avoided.

So for instance, here between the WebGL land and a webGPU land you cannot expect not to have copies.

If the bisexual you have to go from the CPU memory to the GPU memory or vice versa.

but you may hope to have some reference for instance to the different buffers.

And in practice there are already mechanisms in place to get these direct communication lines.

SharedArrayBuffer is an example, Transferable interfaces is another one.

And of course there are plenty of Ad-hoc opaque interfaces that have been designed especially to deal with streams.

And they've been designed as opaque on purpose so that they don't expose the bytes.

And this allows browsers to optimize memory handling.

So that's true for me, just stream track, that's true for all streams, the media element, nodes in the Web Audio API. And I believe that there is an ongoing plan to expose opaque frames in WebCodecs.

Opaque decoded frames in WebCodecs for the ones that are GPU backed.

But WebCodecs folks may want to chime in there at some point, maybe after the talk.

So I thought it would be useful.

I don't know if Bernard has arrived, I thought it would be useful to.

I'm here.

Okay, wonderful.

It might be useful to quantify the, what gains we might expect from reducing memory copies.

And Bernard has been working on a few scenarios there that would shed, I mean, shed some light on this.

So Bernard if you want to introduce the use case the floor is yours.

The use case is Conferencing 'Gallery View,' which has become a popular video conferencing features particularly in education.

Where the teacher may wanna keep an eye on each student in the class with a K through 12 class, often exceeding 25 students, and college classes supporting much larger class sizes.

There's been a demand for larger and larger galleries.

The native applications such as Zoom and Teams are offering seven by seven gallery views.

But I believe that web applications are currently limited to about four by four.

In this scenario the teacher typically only sends one or maybe two streams if they're doing both video and screen share, but they receive many streams.

The use case is well within the capabilities of today's cable internet services.

It can be done typically for less than 30 Megabits, but it pushes the boundaries in the receive decode to display path, due to the large volume of raw video that needs to be displayed.

So we implemented this natively with a zero copy, all GPU pipeline.

So on receive the data will go directly from the network interface to the GPU.

The decode is done in the GPU, and then you go from the GPU directly to restorization.

And if you do all of that, then you can get a seven by seven gallery at 30 frames per second.

With the resolution depending on the device.

But if you do each additional copy and the receive pipeline, because that is the bottleneck will reduce the maximum gallery size.

So if you have one additional copy that will reduce the maximum gallery size to five by five.

If you have two copies, it'll go to four by four, which is where web applications are today.

And if you have three copies, which is what we have in current Web Transport implementations, that would be one copy just purely on the receive.

That probably will go to zero copy over time, but then there's another one for process separation.

And then there's another copy for restorization.

Then you will be limited to three by three.

So this is a use case that because the memory operation is the bottleneck, it directly translates each copy reduces performance directly.

So I'd also add that there's no machine learning going on here on the Receive Path that occurs only on the Send Path.

So for example, if you're doing a background blur or custom backgrounds or anything like that, that happens on the Send-Side.

So it isn't really a consideration here.

That's this a use case for a 'Gallery View.' Thanks a lot Danam.

So I think that's really illustrative of the need, a practical need and practical impact of memory copies there.

And again, the fact that it's not a machine learning issue in this case, it's not nothing related to machine learning, it's broader than that.

So if we, I've been looking around, I've seen various proposals that are under discussions in various groups, which are directly or indirectly related to reducing memory copies.

Those are the ones that are listed there are the one that crossed my radar.

The touch on stream, Web Transport, WebAssembly, WebGPU, WebCodecs, and media in general, WebRTC, of course.

Some of them got started a long time ago.

And so it may be useful to review their status after long.

There may be other additional ideas that may be worth discussing.

And I have no idea whether they've already been discussed whether they're bad ideas or whether they should be discussed.

So for instance, the ability to fetch darkly into the GPU memory, maybe you already possible, I don't know.

Or the ability for an application to be explicit about the media, the processing pipeline that it's going to follow.

If you knew that you, what are the different processing steps that you're going to do on a media stream, perhaps you, if you could tell that in advance to the user agent, there may be a way to optimize things, I don't know, or maybe not.

So that's it, essentially for the presentation.

I believe it would be useful to look at possible scenarios such as the one that Danam presented.

Understand how many boundaries in my diagram these scenarios actually cross, and ask ourselves the question, Are these scenarios being addressed somewhere?

Is there a group that actually tracks them?

" For instance, it may be that most interesting scenarios, only cross one boundary and already have a clear owner, and are being addressed through one-to-one exchanges between the two groups. And if that's the case, then maybe these breakouts is a (indistinct) about nothing and we can just go back to work. Or it may be also that most of the problems are recommendation specific. So maybe there's nothing that we should be working on at the specification level. Or perhaps there is a need for more coordination and given the participation here, I support that at least there are some, there seems to suggest that there is some need. So the question becomes, how do we do it? So let's discuss, I think we can stop the recording now and start the discussion.

Sponsors

Platinum sponsor

Media sponsor

For further details, contact sponsorship@w3.org