Memory copies & zero-copy operations on the Web

Meeting minutes

Francois: at the origin of this breakout session, there was a machine learning workshop organized by Dom in September
… the topic of efficiency issues with real-time media processing was raised by several speakers
… Bernard Aboba in particular mentioned the cost of memory copies in that context
… Likewise, Tero mentioned that in the context of music processing with ML, moving bytes around takes as much processing time as doing the actual processing
… This is not a ML-specific issue
… the GPU on the Web has had similar conversations on the topic last week
… this issue spans multiple groups
… as a result, it may not have a clear owner
… which is why we're convening this conversation
… would like to start by introducing the situation as I understand it
… then I want us to discuss, with a goal of clarifying whether everything is already under control or if instead we need some coordination effort somewhere
… to reason about this, I thought I would start by trying to represent the different components involved in memory copies
… a very rough and incomplete, possibly wrong visualization
… we can split memory between CPU & GPU (generally physically different)
… which means that data needs to go from one to another depending on which unit does the processing
… if you add the browser to this landscape, it manages JS & WASM with CPU, where GPU is under the control of WebGL / WebGPU
… in JS, there are various threads (incl via workers)
… and then browsers will interact with various pieces of hardware and external devices, incl hardware encoders/decoders
… all of these blocks needs to communicate with the browser as a mediator
… which means memory copies as soon as one of the boundaries get crossed
… in a non-optimized version at least
… Memory copies are needed to transfer across boundaries (some physical, some not but may be linked to security checks)
… copies may be needed due to difference of structures (e.g. buffers in JS vs WASM, RGBA vs YUV)

Francois: sometimes, there is a need for a copy for behavior invariant
… sometimes, a copy leads to a better API design
… sometimes, a copy doesn't matter from a performance perspective
… What would it take to reduce memory copies? This would require enabling direct access in a given pipeline
… e.g. allow full media processing in the GPU
… there are already mechnasisms in place to help: sharedarraybuffer, transferable interfaces
… and a number of opaque interfaces where bytes aren't exposed to allow browsers to optimize memory handling (e.g. mediastreamtrack, opaque frames in WebCodec, nodes in WebAudio)

Francois: giving the floor to Bernard to share an example where memory copies show up

Bernard: the use case I wanted to highlight is the gallery view that has become popular in teleconferencing, esp in the context of education
… native apps go up to 7x7 in gallery views, where web apps are limited to 4x4
… the bandwidth is not issue - the bottleneck is in the receive -> display path
… implementing this natively, we've been able to use a full-GPU processing pipeline from reception onwards
… this enables 7x7 gallery at 30FPS
… each copy added in the pipeline reduces the gallery size - 1 copy → 5x5, 2 copies → 4x4, 3 copies (which is what we have with WebTransport today) → 3x3
… here the memory operation has a direct impact on performance
… this involves no ML processing
… (background blur would happen on the sending side, not receiving)

Francois: this illustrates a very practical impact of memory copies
… There are various discussions & proposals linked to this topic across various groups
… linked to Streams, WebTransport, WebAssembly, WebGPU, WebCodecs, WebRTC
… some started a long time ago - may be worth reviewing
… there may be other ideas to consider - e.g. allow direct fetch in GPU memory? Allow to declare a media pipeline to enable memory optimization by the UA
… this concludes my presentation - I think it would be useful to identify scenarios, figure out whether they're being addressed or not
… it may be that the most interesting scenarios only cross one boundary and can be addressed on one-to-one group basis
… or maybe that's just implementation considerations
… or maybe we need more coordination - which the participation here seems to suggest

ChrisC: thank you for that overview - want to share a WebCodecs perspective here
… the ML issues that were the seed of this conversation - WebCodecs will help
… getting frames out of video elements and the need to converse back to canvas, with RGB conversion... WebCodecs is making all of this better and allows to skip canvas altogether, the RGB->YUV conversion
… WebCodecs is facing hard problems with WASM copies - WebCodecs as a VideoFrame primitive
… which allows to copy the planes out into an arraybuffer, which for WASM means wrapping it in a heap and copying it
… due to security concerns
… if you can mutate the data, this creates risks for codecs that don't expect mutations
… Can we have some interfaces for a buffer which we would read into but that once done cannot be modified
… We've been told this is very challenging both in JS & WASM worlds
… and so probably not coming immediately

KenRussell: (Chrome team) confirm this is a very hard problem - needs to memory-protect a segment of the WASM memory
… this would require rearchitecting the WASM engines, slicing up WASM memory to make it read-only is hard, and OS dependent
… ArrayBuffer are transferable, by design, to allow zero-copy across Web workers
… recycling path would have to be redesigned

BenSmith: in WASM, there is access to one type of memory, with several purposes: memory for the language being run (C++, Rust)
… adding another memory to WASM would mean adding support to the underlying language
… Accessing directly through a static memory index is one way, but there are other ways
… it's possible that you could access that memory as a dynamic memory object
… not sure there is a way to do that
… a third way is to take that one memory, use it as address space which can then be transferred
… but that has complexity has well
… part of it is complexity of implementation, part of it is architecture

Yutaka: arraybuffer given to the buffer reader is allocated as shared memory buffer, not local buffer memory - we need some specification that allows optimization (?)

Adam_Rice: (Google), work on streams with Yutaka - if we support sharedarraybuffer, it will be visible to developers, therefore needs standardization work
… but it's difficult to do safely
… by safely I mean protecting the C++ code from data races
… (the C++ browser code)

Francois: how much is an implementation problem vs specification?

Chai: (WebNN API) I have a question wrt WebCodecs and its applicability to ML
… from the ML perspective, esp on GPU, the way data is consumed through GPU buffers, not necessarily through textures
… incl for historical reasons, with the different kind of swizzling (?) patterns done in their own hardware
… GPU buffers are the currency of ML data going into the compute engine
… I heard about WebCodecs implying that the process of decode can be done into the memory texture
… I'm wondering whether that can also be done onto the GPU buffers, because ML processing for video streams or frames can require many kind of transforms
… (color space, formats, ...)
… ML typically uses normalized floats
… if the conversion is not done the right way, it will cause many copies
… also, depending on the destination, this can require more copies
… e.g. for computer vision
… What is the destination? How does it do it? What are the thoughts around producing the data into reusable forms for ML?

Dan_Sanders: (working on WebCodecs)
… the current impl of WebCodecs in Chrome use GPU buffers
… we don't have a way to expose them in a convenient way
… we haven't figured out how to do that yet
… leading proposal is a texture based approach, although I realize that's limitating
… the relevant issue is linked from the slide

PaulAdenot: to reiterate things that were mentioned in previous TPAC and games workshop, one key API pattern for playing nice with memory and real-time processing
… is the concept of memory ownership
… in native, if you take the multimedia framework, you see an API where you pass memory in, which is then written into
… not great ergonomy, but the most sensible way to do it with an unopiniated approach
… it would be good to take a similar approach for the Web

<baboba> Question: Is there a way for WebTransport to support receive into a GPU buffer or send from a GPU buffer? This would improve performance when WebTransport is used in concert with WebCodecs.

PaulAdenot: memory copies add up pretty quickly
… for the audio part, it's not so much that we big objects, but we have extremely high number of them
… so touch memory very often, which also piles up
… one thing is to have lower level APIs without fancy syntactic sugar
… and carefully check where the memory is coming from, and can it uses float32 / WASM buffers

Jan-Ivar: (WebRTC, WebTransport) +1 to Paul
… we've been talking about sources and sinks for memory copies
… but we need to look at the full pipe chain
… decode, modify, play - each of the nodes to access the memory
… do you build the API around the memory optimization path?
… or do you build a declarative pipe chain and leave it to the browser to optimize?
… WebCodecs is not using streams, whereas WebTransport, WebRTC are
… the streams spec has pipeTo allows processing to happen in-parallel-but-not-really
… not clear whether it does allow for the clean API we would like
… if webcodecs doesn't participate in the declarative API, will the whole approach still work?

Francois: so we've looked at the technical issues
… I'm hearing we're confident there are scenarios where this needs to be approached
… do we need a specific coordination to make progress on this? or is it already happening on an ad-hoc basis?

Paul: pair-wise / opportunistic approaches are what we've been doing so far through overlap of participation
… the solution might not be always the same
… but learning from what has been done in other groups has translated wells in the past
… there would be value in better API consistency, incl for ease of use for developers
… more could be done

Ken: I've sat on WebGPU-WebCodecs discussions - they indeed tend to happen pair-wise
… the pattern of passing memory - it remains to be seen if it applies to the all the use cases we have
… would it make sense to create a CG to host these cross-groups discussion?

<MikeSmith> https://‌github.com/‌WICG/‌proposals is one possible place

Keith: is there some kind of a new primitive? a JS object of some kind around which we could coordinate
… may still be difficult

DanS: +1 - getting clarity from other groups beyond happenstance would be great

MikeSmith: one possible place I pasted the URL on IRC - we have a repository which is under the WICG organization for proposals
… one lightweight way would be to raise an issue there and to use that as a coordination place
… I don't know if that's the best fit, but that's one of the motiviations for this repo

Francois: assuming this would work wrt "where", "who" would be willing to contribute?

<Chai> +1

[ChrisC, Paul, Ken, DanSanders, Chai volunteer]

Ken: more discussions sounds useful - not sure we're at the stage where we can get to a single primitive
… Austin has looked a lot into zero-copy into GPU
… a single discussion place sounds great, with roads toward new ideas / designs

Francois: I agree I haven't heard a silver bullet solution, but interest in exchanging on scenarios

Mike: what WICG proposals expect are problem statements, rather than solutions - so despite the name, this would be a good fit there

Francois: summarizing: multiple needs, no easy solution, cross-group collaboration to share ideas and align API designs is needed
… I'll follow up with some of you on how to move forward with this
… thanks a lot for attending

<MikeSmith> I will create a WICG/reducing-memory-copies repo

– DRAFT –
Memory copies & zero-copy operations on the Web

26 October 2020

Attendees

Meeting minutes

Diagnostics