W3C

– DRAFT –
WebML CG Teleconference – 9 Feb 2022

09 February 2022

Attendees

Present
Anssi_Kostiainen, chai, Chai_Chaoweeraprasit, Ganesan_Ramalingam, Honglin_Yu, Jim_Pollock, Jon_Napper, Jonathan_Bingham, Ningxin_Hu, qjw, Rafael_Cintron, Sam_Bhattacharyya, Sami_Kyostila, Saurabh, Tom_Shafron, tyuvr
Regrets
-
Chair
Anssi
Scribe
Anssi, anssik, Jonathan, Jonathan_Bingham

Meeting minutes

<skyostil> Hi, nice to meet you all! Dialing in from Google

Model Loader API

anssik: welcome to the second WebML CG call of 2022! We adjusted the agenda a bit to not overlap with Australia day.
… We agreed to move to a monthly cadence after this call.

anssik: On our previous 12 Jan meeting Honglin delivered an Update on Model Loader API

WebML CG Teleconference – 12 Jan 2022 - meeting minutes

anssik: this presentation included a list of open questions. The plan for today is to continue discuss those

Honglin: updates since previous meeting, thanks to reviewers of the prototype, we've tried accelerate IPC with renderer talking to backend directly
… after discussion with Google's security teams, shared buffer is security sensitive and there are some concerns with it per their early review
… guidance is to avoid using shared buffer and use direct IPC
… 1.3 ms delay, moving forward with direct IPC
… most open questions still apply so good to discuss them today
… some float16 issues identified too, need WebNN folks input
… some resource restrictions also discussed, what resource ML service should be able to use

RafaelCintron: question, what did the security people say about using separate process?

... Why not use another thread in the same process?

Honglin: no concerns with process isolation

Honglin: Security didn't request it. We decided to. If we run in the same process, we have to harden TF Lite even more.

Honglin: the backend process can be safer than the renderer, we can disable more system calls
… we think separate process is a reasonable starting point

RafaelCintron: Does the ML service have the same security constraints as the renderer?

Honglin: We have one process per model. Even if a hacker takes over, it's isolated.
… We could put everything in the renderer. It might be more work.

RafaelCintron: In Window, the renderer is the most secure
… There's overhead to having a separate process.

Honglin: It's an open question.
… We have a mature ML service for TF Lite.
… It's maybe less work to use the existing service

RafaelCintron: What if 2 websites load the same model?

Honglin: There are still 2 processes.

RafaelCintron: If you keep everything from the same domain in one process, it should be sufficient. A bad actor can't load 100 models and create 100 processes.

Honglin: We can restrict the # processes, RAM, CPUs

Anssik: If you can move more of the security discussion into the public, more people can look at the comments as evidence. Discuss in github or public if possible.

Benchmark discussion

Anssik: Let's discuss and identify next steps.

Honglin: Not much to share. Improved a lot.
… Model loader performance is very similar to raw backend, and much faster than TF Lite WASM runner, even using the same xnnpack backend
… for MobileNet image classification

RafaelCintron: General question: since everything runs on the CPU, why not all with WASM? What's missing from WASM?
… Is it just a matter of more or wider SIMD instructions?
… Or is there something preventing WASM from doing better?

Honglin: We talked to the WASM team and haven't figured out for sure. There are CPU instructions that can't be supported in WASM. There's also compilation issues.
… When WASM is going to run, it has a JIT step, even if it's precompiled.
… That's the best guess. Not sure of exact reason

Ningxin: Honglin's assessment is good. SIMD instructions support 128 bit. Honglin's platform tests support 256 bit widths, which may explain the difference.

RafaelCintron: If we have 256 in WASM would that close the gap?

Ningxin: Flexible SIMD instructions in WASM will support wider instructions. We can check with them with the early prototype for their performance numbers.

Hardware acceleration

Anssi: What will the interface look like when hardware acceleration is supported? What if data is on a GPU or TPU? Should users know what accelerators are available? Performance vs privacy tradeoff.
… Web NN is thinking of the same issues. We can share learnings.

Honglin: Not too many updates. We have platform experts joining this group. Maybe they have updates.

jimpollock: I work with Sami and Honglin on hardware acceleration. Mostly on deployment taking Web ML code and plumbing through to hardware acceleration.
… Use Web NN APi on the backend. Targeting media devices with GPU. Starting with CPU to get the flow working.
… How do we expose acceleration to the developer?
… A few options; do we allow partitioned execution, and the accelerator runs what it can and delegates the rest to CPU? Could be less optimal than doing all on CPU
… Or do we give more info to the user, like what accelerators are supported, or even what ops are supported. But fingerprinting concerns with that.
… Thinking through design choices. Working on plumbing to get benchmarks.

Anssi: WebNN is also discussing similar issues.

Chai: Is WebNN also available in ChromeOS?

jimpollock: Yes. Rather than coming up with a new paradigm, we decided to use WebNN for some stability against breaking changes in TF Lite.
… It's been ported from Android. We disabled some features like hardware buffers. A full port.

Chai: For the purposes of this prototype, you're using the CPU backend for the API, but are thinking of GPU?

jimpollock: There's no driver for the GPU. Just using CPU to prove the integration. Then will run on accelerators with full end-to-end solution.

<skyostil> Minor correction: Jim is referring to NNAPI (i.e., the Android ML hardware adaptation layer) instead of WebNN

jimpollock: Run with the same benchmarks as Honglin and see.

Honglin: We can partition the graph with heuristics.
… But there may be fingerprinting issues if we let the developer or web apps decide how.
… Transparent vs explicit is the tradeoff.

RafaelCintron: For accelerators you're targeting, Jim, how does data transfer work?
… Say you're getting 2 MB. Can you give a pointer to CPU memory? Do you have to copy it and then you can use it? Do you have to copy back after before looking at it?

jimpollock: It varies from accelerator to accelerator.
… It runs in the TF runtime in the ML Service binary executable process. The NNAPI delegate walks the graph and calls all the methods for the NNAPI representation including tensors.
… Then it crosses the boundary to the vendor implementation. Depends. Can be memory copy or shared representation.
… Some have onboard memory. Examples of both. Vendors keep some details secret.
… Not a single answer. Could happen in any permutation depending on vendor driver implementation.

RafaelCintron: You give them a pointer to memory and it's up to them to operate directly or copy.

jimpollock: Yes. We have some of the same security issues Honglin mentioned.
… We need to make sure won't have shared memory concerns.
… May still need mem copies depending on what the security team says.

Chai: Trying to understand how we achieve hardware portability across different accelerators.
… You're testing with NNAPI. Drivers are CPU. How does that work across differnet hardware?
… For WebNN, the whole thing is about defining the interface that anyone can implement in any accelrator in the platform
… If you run the same model and the same content, regardless of hardware you run on, you can accelerate
… Is it because we want to understand through the benchmark before taking the next step? How does it connect with hardware portability on different systems?

jimpollock: For Chrome OS or more general?

Chai: Looking for education. That's why asking about NNAPI on Chrome OS. The driver might be different. Does it come with Chrome OS?
… Obviously different PCs have different hardware. How does it target different vendor GPUs?

jimpollock: Consider different platforms. On a simple system with no accelerators, NNAPI wouldn't be enabled. There'd be no delegation.
… If a system has an accelerator and the vendor has provided a driver.
… We'd enable NNAPI and when model is processed by TFLite or inferencing system...
… It queries delegate, which reports which parts of the model it can process.
… Then partitioning comes up. Support full models, or partitioned delegation?
… How portability works is the combination of hardware availability, driver installed, inferencing engine takes it into account when setting up interpreter. Delegates as directed to.
… More complex: multiple accelerators.
… Gets specific to NNAPI. Can be done. Can partition across multiple. In practice, don't see it often. Usually gives less performance.
… Marshaling data back and forth between kernels is slow.
… Dynamic system, looks at what's available, installed, decide what to process where based on caller's constraints for partitioned execution.

Chai: Makes sense. Solution is specific to Chrome OS. Not PC, where you can't determine what accelerator is running.
… For instance, a PC running a Chrome browser. That's a common environment. Most PCs have accelerators, a GPU.
… If the solution is specific to Chrome OS, vendors may put together different drivers.

jimpollock: I'm giving the specifics for Chrome OS because I know it. There's an abstraction layer like OpenGL. Your inferencing engine speaks Open GL and the vendor library applies it.
… NNAPI plays that role in Chrome OS
… NNAPI is like OpenGL, a common interface. There are various extensions. They can be used if available.
… Don't know the Microsoft equivalent, expect it's similar.

Chai: The purpose of WebNN as a spec is to define the abstraction for ML that's higher than OpenGL.
… That's why the spec was written that way. Then we can accelerate ML differently than regular shaders.
… The abstraction for the NNAPI solution is what I'm trying to understand
… To build something that works for chromium, not only on Chrome OS, the abstraction would look very much like Web NN, not NNAPI.
… In a sense they're the same. Operator-based abstraction.
… You've answered the question. NNAPI is used to understand how to layer.

jimpollock: I agree NNAPI doesn't exist everywhere. Not on Windows. DirectML might be the equivalent arbitrator
… Each platform would have its equivalent.

Chai: The way we think about DirectML (I run the DirectML group at Microsoft)
… Our goal is to have an abstraction that's not specific to Windows. Load Model could sit on top of WebNN so an OS could sit on top.

Honglin: If only using GPU, that can be done. We want to support more accelerators, which is harder.
… We have to start somewhere.
… NNAPI is a good starting point.

RafaelCintron: Yes, it's true WebNN accepts WebGL or WebGPU. That doesn't mean it has to be only GPU.
… We could add accelerator adaptors later
… Then you can instantiate and some features might not be available. We can add overloads that support different kinds of buffers. They can be accelerator speciifc.
… There's still a lot of value in looking at it.

Honglin: A lot of PCs have powerful GPUs. On mobile, the ML accelerator may be more valuable.
… If the ML accelerator buffers can be used by WebNN, Model Loader can also use them
… What do you think?

RafaelCintron: We could do it a couple ways. The same overloads, or new ones for other kinds of buffers. Accelerator buffers. Or array buffers.
… Run in a special process instead of CPU. Could use NNAPI under the covers.
… When Ningxin and Chai specified the operators, they looked at other platforms, not only Windows. They made sure the ops could be implemented, including on NNAPI.
… If not, we can revise them.

Chai: When we started WebNN, the goal was to define the abstraction for ML accelerators. We're not setting out to build something specific to CPU or GPU.
… It's a portable abstraction across hardware.
… We acknowledge that the web stack is sitting on an ecosystem with a lot of different hardware.
… MacOS and Windows are moving all the time. We want to avoid defining something only valid for today's hardware.
… DirectML is an implementation at one point in time on Windows. We want to support more accelerators in the ecosystem.
… The abstraction is separate from the implementation. The abstraction is what we aim for.

Honglin: Great point For model loader, we make the portable thing to the backend. It's the responsibility of the backend. TF has already put a lot of effort into making it portable.
… What I mean is, when the accelerator is available to WebNN, it's the same that's available to Model Loader.
… We're facing the same issue. For TF, the ONNX format is very similar to what the WebNN API is doing, defining the computational graph, the tnesors.

Chai: Not exactly. On the surface, WebNN has convolution and ONNX and TF do.
… WebNN is designed to be the backend, and not the frontend interface.
… If NNAPI is an implementation that's on the move, that's fine. NNAPI drivers are not available everywhere.
… For testing and proving the API, it's fine
… As we do more, let's keep in mind that we want an abstraction like WebNN.

Honglin: ML accelerators are in rapid development. They may change a lot in the future.
… It's hard to cover all cases. GPUs are more mature than ML accelerators.
… We're aware NNAPI is not available everywhere. That's where we're starting for Chrome OS.

Ningxin: Regarding NNAPI, I agree.
… For the WebNN spec, in the working group, we took NNAPI as an implementation backend option when checking each operator's compatibility.
… We test against DirectML and NNAPI on Android or ChromeOS.
… I have an ask for help. For some issue discussions, we have questions about how ops can be supported by NNAPI.
… We'd like to be connected to the right people.
… Can you help?

jimpollock: You're looking for understanding of whether a WebNN op spec would be supported by NNAPI? Yes, we can help review and see what the mapping is like.
… A lot of ops have constraints. There may be certain modes that are supported, and others not

Ningxin: We also have WebNN Native project, to evaluate.

WebNN-native project

Ningxin: We have a PR to add NNAPI backend. We have questions about implementability.
… We can summarize and ask for your input.

<ningxin_hu> webnn-native nnapi backend PR: https://github.com/webmachinelearning/webnn-native/pull/178

anssik: HUGE thanks to Jonathan for scribing, great job!
… we'll defer versioning, streaming input and model format discussion topics to our next call
… we agreed to have our next call one month from now, but Jonathan and Honglin please share if you'd like to have our next call sooner, two weeks from now

Minutes manually created (not a transcript), formatted by scribe.perl version 185 (Thu Dec 2 18:51:55 2021 UTC).

Diagnostics

Succeeded: i/Topic: Model Loader API/scribe+ Jonathan

Maybe present: Anssi, anssik, Honglin, jimpollock, Ningxin, RafaelCintron