WebML CG Teleconference – 9 Feb 2022

Meeting minutes

<skyostil> Hi, nice to meet you all! Dialing in from Google

Model Loader API

anssik: welcome to the second WebML CG call of 2022! We adjusted the agenda a bit to not overlap with Australia day.
… We agreed to move to a monthly cadence after this call.

anssik: On our previous 12 Jan meeting Honglin delivered an Update on Model Loader API

WebML CG Teleconference – 12 Jan 2022 - meeting minutes

anssik: this presentation included a list of open questions. The plan for today is to continue discuss those

Honglin: updates since previous meeting, thanks to reviewers of the prototype, we've tried accelerate IPC with renderer talking to backend directly
… after discussion with Google's security teams, shared buffer is security sensitive and there are some concerns with it per their early review
… guidance is to avoid using shared buffer and use direct IPC
… 1.3 ms delay, moving forward with direct IPC
… most open questions still apply so good to discuss them today
… some float16 issues identified too, need WebNN folks input
… some resource restrictions also discussed, what resource ML service should be able to use

RafaelCintron: question, what did the security people say about using separate process?

... Why not use another thread in the same process?

Honglin: no concerns with process isolation

Honglin: Security didn't request it. We decided to. If we run in the same process, we have to harden TF Lite even more.

Honglin: the backend process can be safer than the renderer, we can disable more system calls
… we think separate process is a reasonable starting point

RafaelCintron: Does the ML service have the same security constraints as the renderer?

Honglin: We have one process per model. Even if a hacker takes over, it's isolated.
… We could put everything in the renderer. It might be more work.

RafaelCintron: In Window, the renderer is the most secure
… There's overhead to having a separate process.

Honglin: It's an open question.
… We have a mature ML service for TF Lite.
… It's maybe less work to use the existing service

RafaelCintron: What if 2 websites load the same model?

Honglin: There are still 2 processes.

RafaelCintron: If you keep everything from the same domain in one process, it should be sufficient. A bad actor can't load 100 models and create 100 processes.

Honglin: We can restrict the # processes, RAM, CPUs

Anssik: If you can move more of the security discussion into the public, more people can look at the comments as evidence. Discuss in github or public if possible.

Benchmark discussion

Anssik: Let's discuss and identify next steps.

Honglin: Not much to share. Improved a lot.
… Model loader performance is very similar to raw backend, and much faster than TF Lite WASM runner, even using the same xnnpack backend
… for MobileNet image classification

RafaelCintron: General question: since everything runs on the CPU, why not all with WASM? What's missing from WASM?
… Is it just a matter of more or wider SIMD instructions?
… Or is there something preventing WASM from doing better?

Honglin: We talked to the WASM team and haven't figured out for sure. There are CPU instructions that can't be supported in WASM. There's also compilation issues.
… When WASM is going to run, it has a JIT step, even if it's precompiled.
… That's the best guess. Not sure of exact reason

Ningxin: Honglin's assessment is good. SIMD instructions support 128 bit. Honglin's platform tests support 256 bit widths, which may explain the difference.

RafaelCintron: If we have 256 in WASM would that close the gap?

Ningxin: Flexible SIMD instructions in WASM will support wider instructions. We can check with them with the early prototype for their performance numbers.

Hardware acceleration

Anssi: What will the interface look like when hardware acceleration is supported? What if data is on a GPU or TPU? Should users know what accelerators are available? Performance vs privacy tradeoff.
… Web NN is thinking of the same issues. We can share learnings.

Honglin: Not too many updates. We have platform experts joining this group. Maybe they have updates.

jimpollock: I work with Sami and Honglin on hardware acceleration. Mostly on deployment taking Web ML code and plumbing through to hardware acceleration.
… Use Web NN APi on the backend. Targeting media devices with GPU. Starting with CPU to get the flow working.
… How do we expose acceleration to the developer?
… A few options; do we allow partitioned execution, and the accelerator runs what it can and delegates the rest to CPU? Could be less optimal than doing all on CPU
… Or do we give more info to the user, like what accelerators are supported, or even what ops are supported. But fingerprinting concerns with that.
… Thinking through design choices. Working on plumbing to get benchmarks.

Anssi: WebNN is also discussing similar issues.

Chai: Is WebNN also available in ChromeOS?

jimpollock: Yes. Rather than coming up with a new paradigm, we decided to use WebNN for some stability against breaking changes in TF Lite.
… It's been ported from Android. We disabled some features like hardware buffers. A full port.

Chai: For the purposes of this prototype, you're using the CPU backend for the API, but are thinking of GPU?

jimpollock: There's no driver for the GPU. Just using CPU to prove the integration. Then will run on accelerators with full end-to-end solution.

<skyostil> Minor correction: Jim is referring to NNAPI (i.e., the Android ML hardware adaptation layer) instead of WebNN

jimpollock: Run with the same benchmarks as Honglin and see.

Honglin: We can partition the graph with heuristics.
… But there may be fingerprinting issues if we let the developer or web apps decide how.
… Transparent vs explicit is the tradeoff.

RafaelCintron: For accelerators you're targeting, Jim, how does data transfer work?
… Say you're getting 2 MB. Can you give a pointer to CPU memory? Do you have to copy it and then you can use it? Do you have to copy back after before looking at it?

jimpollock: It varies from accelerator to accelerator.
… It runs in the TF runtime in the ML Service binary executable process. The NNAPI delegate walks the graph and calls all the methods for the NNAPI representation including tensors.
… Then it crosses the boundary to the vendor implementation. Depends. Can be memory copy or shared representation.
… Some have onboard memory. Examples of both. Vendors keep some details secret.
… Not a single answer. Could happen in any permutation depending on vendor driver implementation.

RafaelCintron: You give them a pointer to memory and it's up to them to operate directly or copy.

jimpollock: Yes. We have some of the same security issues Honglin mentioned.
… We need to make sure won't have shared memory concerns.
… May still need mem copies depending on what the security team says.

Chai: Trying to understand how we achieve hardware portability across different accelerators.
… You're testing with NNAPI. Drivers are CPU. How does that work across differnet hardware?
… For WebNN, the whole thing is about defining the interface that anyone can implement in any accelrator in the platform
… If you run the same model and the same content, regardless of hardware you run on, you can accelerate
… Is it because we want to understand through the benchmark before taking the next step? How does it connect with hardware portability on different systems?

jimpollock: For Chrome OS or more general?

Chai: Looking for education. That's why asking about NNAPI on Chrome OS. The driver might be different. Does it come with Chrome OS?
… Obviously different PCs have different hardware. How does it target different vendor GPUs?

jimpollock: Consider different platforms. On a simple system with no accelerators, NNAPI wouldn't be enabled. There'd be no delegation.
… If a system has an accelerator and the vendor has provided a driver.
… We'd enable NNAPI and when model is processed by TFLite or inferencing system...
… It queries delegate, which reports which parts of the model it can process.
… Then partitioning comes up. Support full models, or partitioned delegation?
… How portability works is the combination of hardware availability, driver installed, inferencing engine takes it into account when setting up interpreter. Delegates as directed to.
… More complex: multiple accelerators.
… Gets specific to NNAPI. Can be done. Can partition across multiple. In practice, don't see it often. Usually gives less performance.
… Marshaling data back and forth between kernels is slow.
… Dynamic system, looks at what's available, installed, decide what to process where based on caller's constraints for partitioned execution.

Chai: Makes sense. Solution is specific to Chrome OS. Not PC, where you can't determine what accelerator is running.
… For instance, a PC running a Chrome browser. That's a common environment. Most PCs have accelerators, a GPU.
… If the solution is specific to Chrome OS, vendors may put together different drivers.

jimpollock: I'm giving the specifics for Chrome OS because I know it. There's an abstraction layer like OpenGL. Your inferencing engine speaks Open GL and the vendor library applies it.
… NNAPI plays that role in Chrome OS
… NNAPI is like OpenGL, a common interface. There are various extensions. They can be used if available.
… Don't know the Microsoft equivalent, expect it's similar.

Chai: The purpose of WebNN as a spec is to define the abstraction for ML that's higher than OpenGL.
… That's why the spec was written that way. Then we can accelerate ML differently than regular shaders.
… The abstraction for the NNAPI solution is what I'm trying to understand
… To build something that works for chromium, not only on Chrome OS, the abstraction would look very much like Web NN, not NNAPI.
… In a sense they're the same. Operator-based abstraction.
… You've answered the question. NNAPI is used to understand how to layer.

jimpollock: I agree NNAPI doesn't exist everywhere. Not on Windows. DirectML might be the equivalent arbitrator
… Each platform would have its equivalent.

Chai: The way we think about DirectML (I run the DirectML group at Microsoft)
… Our goal is to have an abstraction that's not specific to Windows. Load Model could sit on top of WebNN so an OS could sit on top.

Honglin: If only using GPU, that can be done. We want to support more accelerators, which is harder.
… We have to start somewhere.
… NNAPI is a good starting point.

RafaelCintron: Yes, it's true WebNN accepts WebGL or WebGPU. That doesn't mean it has to be only GPU.
… We could add accelerator adaptors later
… Then you can instantiate and some features might not be available. We can add overloads that support different kinds of buffers. They can be accelerator speciifc.
… There's still a lot of value in looking at it.

Honglin: A lot of PCs have powerful GPUs. On mobile, the ML accelerator may be more valuable.
… If the ML accelerator buffers can be used by WebNN, Model Loader can also use them
… What do you think?

RafaelCintron: We could do it a couple ways. The same overloads, or new ones for other kinds of buffers. Accelerator buffers. Or array buffers.
… Run in a special process instead of CPU. Could use NNAPI under the covers.
… When Ningxin and Chai specified the operators, they looked at other platforms, not only Windows. They made sure the ops could be implemented, including on NNAPI.
… If not, we can revise them.

Chai: When we started WebNN, the goal was to define the abstraction for ML accelerators. We're not setting out to build something specific to CPU or GPU.
… It's a portable abstraction across hardware.
… We acknowledge that the web stack is sitting on an ecosystem with a lot of different hardware.
… MacOS and Windows are moving all the time. We want to avoid defining something only valid for today's hardware.
… DirectML is an implementation at one point in time on Windows. We want to support more accelerators in the ecosystem.
… The abstraction is separate from the implementation. The abstraction is what we aim for.

Honglin: Great point For model loader, we make the portable thing to the backend. It's the responsibility of the backend. TF has already put a lot of effort into making it portable.
… What I mean is, when the accelerator is available to WebNN, it's the same that's available to Model Loader.
… We're facing the same issue. For TF, the ONNX format is very similar to what the WebNN API is doing, defining the computational graph, the tnesors.

Chai: Not exactly. On the surface, WebNN has convolution and ONNX and TF do.
… WebNN is designed to be the backend, and not the frontend interface.
… If NNAPI is an implementation that's on the move, that's fine. NNAPI drivers are not available everywhere.
… For testing and proving the API, it's fine
… As we do more, let's keep in mind that we want an abstraction like WebNN.

Honglin: ML accelerators are in rapid development. They may change a lot in the future.
… It's hard to cover all cases. GPUs are more mature than ML accelerators.
… We're aware NNAPI is not available everywhere. That's where we're starting for Chrome OS.

Ningxin: Regarding NNAPI, I agree.
… For the WebNN spec, in the working group, we took NNAPI as an implementation backend option when checking each operator's compatibility.
… We test against DirectML and NNAPI on Android or ChromeOS.
… I have an ask for help. For some issue discussions, we have questions about how ops can be supported by NNAPI.
… We'd like to be connected to the right people.
… Can you help?

jimpollock: You're looking for understanding of whether a WebNN op spec would be supported by NNAPI? Yes, we can help review and see what the mapping is like.
… A lot of ops have constraints. There may be certain modes that are supported, and others not

Ningxin: We also have WebNN Native project, to evaluate.

WebNN-native project

Ningxin: We have a PR to add NNAPI backend. We have questions about implementability.
… We can summarize and ask for your input.

<ningxin_hu> webnn-native nnapi backend PR: https://github.com/webmachinelearning/webnn-native/pull/178

anssik: HUGE thanks to Jonathan for scribing, great job!
… we'll defer versioning, streaming input and model format discussion topics to our next call
… we agreed to have our next call one month from now, but Jonathan and Honglin please share if you'd like to have our next call sooner, two weeks from now

– DRAFT –
WebML CG Teleconference – 9 Feb 2022

09 February 2022

Attendees

Meeting minutes

Model Loader API

Benchmark discussion

Hardware acceleration

Diagnostics