WebML WG Teleconference – 13 June 2024

Meeting minutes

Repository: webmachinelearning/webnn

anssik: please welcome Jan Williams representing TPGi
… and Sushanth Rajasankar representing Microsoft Corporation to the WebML WG!
… I made a few agenda tweaks:

Agenda diff for new announcements

Agenda diff for issues addressed or closed

Announcements

Utilities for spec editors

webnn/tools

anssik: editing web specs is not an easy task, so I'm happy to announce Joshua Bell has contributed utilities to help editors and contributors in authoring the WebNN spec and reviewing spec changes
… the tools are currently purpose-built for WebNN, but could be generalized in the future
… thanks Josh! Quoting the readme:
… - reformat-js.py applies clang-format to JavaScript blocks in the spec
… - lint.mjs analyses the spec Bikeshed source and generated HTML to look for common errors like duplicate words or unclosed links, and helps enforce the coding conventions

Do you want to briefly introduce the these tools to the group? Any caveat or areas where contributions would be particularly welcome?

Josh: I don't want to make the same mistake twice, so wrote these tools to help with that :)

Implementation status update

https://webmachinelearning.github.io/webnn-status/

anssik: we have exciting updates to the implementation status, thanks everyone for your hard work!
… - XNNPACK backend was replaced by TensorFlow Lite across Windows, ChromeOS, Android and Linux
… (the earlier MLService data was merged into the newly added TFLite column)
… - added Core ML implementation status
… - updated ONNX Runtime Web support info for version 1.18.0

Implementation status diff

<gb> MERGED Pull Request 78 Update implementation status (TFLite + Core ML) (by ibelem)

CoreML support status diff

<gb> MERGED Pull Request 80 Add WebNN Core ML implementation status (by ibelem)

Awesome WebNN updates

anssik: I'd like to also share some awesome updates from our community members

anssik: Onnx2Text converts an ONNX ML model protobuf from/to text, by Dwayne

Onnx2Text

anssik: many developers are familiar with Netron so Belem created a fork that integrates WebNN support status info into the tool

Netron with WebNN support status info

anssik: I think this Netron enhancement could be perhaps upstreamed to Netron in the near future

anssik: also, Joshua Bell has been working on NNotepad, a browser-based playground for experimenting with WebNN expressions without boilerplate code

Josh: this was a hackaton output

NNotepad WebNN Playground

Other Awesome WebNN updates include WebNN samples for CPU, GPU and NPU

<gb> MERGED Pull Request 7 Add NNotepad sample and some tools (by ibelem)

Hybrid AI exploration update

anssik: I've asked the Hybrid AI exploration team to present key findings informed by the caching prototype, security and privacy considerations, possible solutions for discussion
… a WebML Community Group repo was recently created for these discussions

Hybrid AI Exploration GH repo

anssik: I also want to acknowledge work by our Google participants who have explored the broader problem space around caching models in the browser
… and recently published an article discussing exploring how to use Cache API, Origin Private File System API, IndexedDB API, File System Access API for caching large AI models, with a recommendation to use the Cache API

Cache AI models in the browser

MediaPipe LLM demo

anssik: Mike, you have ~10 minutes, go!

Slideset: https://lists.w3.org/Archives/Public/www-archive/2024Jun/att-0000/WebML_WG_Hybrid_AI_Caching.pdf

[Slide 1]

[Slide 2]

Mike: Outline
… - Model size and download times
… - Security and privacy considerations
… - Caching requirements
… - Possible solutions - no silver bullet! tradeoffs

[Slide 3]

Mike: Key points from offline discussion
… - some models too large download during session
… - focus on use cases over specific models
… - adapters and variants are challenging, "baked in" into models, model variants with diff in quantization etc.

[Slide 4]

Mike: Security and Privacy Considerations
… - Current browsers implement per-origin local HTTP caches
… - Cross-site privacy risk based on cache timing analysis
… - Tolerable for non-AI web resources, images, script libraries
… - BUT models are large and potentially shared

[Slide 5]

Mike: Possible Mitigations
… 1. Disallow use or WebNN in 3rd party context by default, this is done
… 2. Generate keys based on actual model content to avoid data exfiltration, but possibly not tracking
… 3. Limit number of built models and cache checks to avoid use of multiple model existence checks

[Slide 6]

Mike: Caching Desired Properties:
… - Reduce Latency
… - Reduce Bandwidth
… - Reduce Storage
… cross-site, across implementations, model consolidation?
… - Preserve Privacy
… observation, hard to do 1 and 4 together, 2 and 3 easier

[Slide 7]

Mike: Proposal Define New Model-Aware Caches
… key ideas:
… 1. Use "fake misses" (delays) to avoid redundant downloads
… 2. Progress model load/timers only when requesting page is inactive
… 3. Identify cache items by content-dependent hashes
… 4. Use deduplication to avoid redundant storage
… Some alternatives:
… 1. Use existing APIs/caches perhaps with extensions
… 2. Use the File System API + Background Fetch

[Slide 8]

Mike: Prototype Status
… Implemented node cache with hashes as keys, using external Redis service for storage
… Model cache seems to be more generally useful
… Next steps:
… - Implement model cache
… - Base on Service Worker Cache, Background Fetch if possible
… - Three implementation options:
… 1. Capture/replay graph building (shim + extension)
… 2. Modify Chromium implementation (best for perf)
… 3. Cache an existing model serialization and use model loader

anssik: thanks Mike, any quick questions?
… comments and questions also welcome via the dedicated GH repo too:

webmachinelearning/hybrid-ai

<dwayner8> Mike; thanks for raising these pain points.

Josh: we're in touch with Mike on this topic

NPU support

anssik: issue #623 and PR #696

<gb> Pull Request 696 Add MLDeviceType npu (by fdwr)

<gb> MERGED Pull Request 523 Wording change: Tighten up output shape calculation algorithms (by inexorabletash)

anssik: thanks for the reinvigorated interest and discussion for this topic everyone!
… I'd like to try pull it all together:
… We agreed to start with the simplest design option 1 deviceType: "npu" enum with system-decided fallback, this is what's we have in PR #696
… Pros:
… + Very simple API
… + Least to test
… + Affords backends the most control for fallback, since only the primary device preference is specified
… Cons:
… - App cannot specify the fallback preference, as the system instead decides any fallback devices
… in our past discussions, this option 1 received the most support:
… - Phillis commented "I actually like the simplicity of option 1. As long as we make it clear in the spec that the system may decide to fallback to other devices."

webmachinelearning/webnn#623 (comment)

<gb> Issue 623 WebNN should support NPU and QDQ operations (by wchao1115) [v2] [opset] [feature request]

anssik: - In our 2024-05-02 meeting the WG agreed to start with option 1 reserving the option to potentially expand if implementation experience shows need, thumbs up from Ningxin and MikeW

webmachinelearning/webnn#623 (comment)

anssik: - on 2024-05-30 MikeW asked "Why is MLDeviceType even necessary and shouldn't it be a browser implementation decision to choose the most appropriate processor given the existing MLPowerPreference?"

webmachinelearning/webnn#623 (comment)

anssik: - Zoltan responded "that could be possible even with the current API shape [...] if we can spec a way implementations could disregard user preferences/hints, telling if hints were overridden, or a fallback happened"

webmachinelearning/webnn#623 (comment)

anssik: - JoshB shared past discussion (from Feb 2024) on the use cases for explicit device type in #302

<gb> Issue 302 API simplification: context types, context options, createContext() (by zolkis) [v2]

webmachinelearning/webnn#302 (comment)

<gb> Issue 302 API simplification: context types, context options, createContext() (by zolkis) [v2]

webmachinelearning/webnn#623 (comment)

anssik: at that time "coordinating ML and other workload across devices" was identified as a use case, and the proposal suggested back then by Josh was to add additional hints such as "accuracy vs performance"
… later in that issue #302 Chai pointed out the "delegating it to the OS" approach sounds good "until you realize that the OS can make mistakes you cannot correct so be careful what you wish for"
… Chai also noted his preference is a device type "npu" paired with a programmable fallback device (either "gpu" or "cpu"), a design that he says is "likely to produce a better result with higher efficiency at a cost of better predictability" versus a design where "any operator can fail with an unsupported failure"

webmachinelearning/webnn#302 (comment)

anssik: on 2024-06-02 MikeW notes "WebGPU, which can be implemented purely in software without a physical GPU" and asks "why is WebNN specifying a physical device type and not leaving this up to the implementation?" and proposes "It is the browser implementation which ensures WebNN computations are consistent across any physical hardware device it runs on. In that scenario, MLDeviceType should be removed from the WebNN API."

webmachinelearning/webnn#623 (comment)

anssik: Ningxin shared use cases that require specifying a device type:
… - compute offloading: a game engine may want to run ML tasks on CPU to avoid interfering with the GPU time budget. See WebNN API in gaming scenarios discussion of WG 2022/04/07.

WebML WG 2022-04-07 telcon: WebNN API in gaming scenarios
… - op fallback: a ML framework may want to create a CPU context with fallback to Wasm option, that would avoid expensive cross-device tensor data copy between WebNN graph inference and Wasm operators execution. Custom operations #6 and Support CPU - WebAssembly scenario of the op level execution use case #156

<gb> Issue 6 Custom operations (by dsmilkov) [v2]

<gb> CLOSED Issue 156 Support CPU - WebAssembly scenario of the op level execution use case (by huningxin)

webmachinelearning/webnn#623 (comment)

<gb> Issue 623 WebNN should support NPU and QDQ operations (by wchao1115) [v2] [opset] [feature request]

anssik: I see valuable use cases and I also see great discussion and questions. We've spent good time discussing the design. I have a reason to believe the group would benefit from wider feedback. We've learned an effective way to get such feedback is to allow users and developers explore and play with the API.
… I'd propose we update the PR #696 with a prominent in-line issue block on top of the MLContextOptions spec section with a text that clarifies the status of this feature, proposal:

<gb> Pull Request 696 Add MLDeviceType npu (by fdwr)

"ISSUE: MLContextOptions is under active development and the design is expected to change informed by further implementation experience and new use cases from the wider web community. The Working Group is considering additional API surface to allow definition of a fallback device, multiple devices in a preferred order, or an exclusion of a specific device. Other considerations under discussion include error handling, ultimate

fallback and quantized operators. Feedback is welcome on any of these design considerations from web developers, library authors, OS and hardware vendors, and other stakeholders via GitHub issue #623."

<gb> MERGED Pull Request 523 Wording change: Tighten up output shape calculation algorithms (by inexorabletash)

MLContextOptions

anssik: I believe with this in-line issue added to clarify the status of this feature we can merge this PR to start gather valuable wider feedback from the API users for this feature
… I expect the group to revisit the issue to review wider feedback and adjust the design accordingly to ensure what we will ultimately standardize on a design that solves the right problems faced by the API users
… any comments, suggestions?

jsbell: thanks for the summary Anssi, agree with your proposal for moving forward
… current device type is required for current level of prototyping and testing to test we exercise all the backends, I really like your approach Anssi!

zkis: In the issue I said with the current API we can satisfy Mike's ask

Dwayne: we do need this design if only for testing purposes, I like the prose Anssi proposed will integrate

MikeApple: we understand the use cases for NPU devices and agree with Josh we can now experiment with it and found how it works, our concern is it maps with hardware that exists today, and the spec will live long

anssik: I hear consensus to merge PR #696 after adding the note to MLContextOptions section

<gb> Pull Request 696 Add MLDeviceType npu (by fdwr)

MLBuffer

Direct buffer sharing proposal

anssik: issue #688

<gb> Issue 688 [MLBuffer] Support interop with WebGPU (by bbernhar) [webgpu interop]

anssik: Last call for comments for the direct buffer sharing proposal before we share it with the WebGPU group
… we've received feedback from Rafeal (thank!), please feel free to share

webmachinelearning/webnn#688 (comment)

<gb> Issue 688 [MLBuffer] Support interop with WebGPU (by bbernhar) [webgpu interop]

Rafael: I agree with the overall thing proposed, implementation experience is key
… we can work on the details with Bryan and then talk to WebGPU group

Bryan: all good

Rafeal: Microsoft and Apple people are the same for WebNN and WebGPU

MLGraphBuilder and MLBuffer construction alignment proposal

anssik: issue #697

<gb> Issue 697 Inconsistency between MLGraphBuilder and MLBuffer construction (by reillyeon) [webgpu interop]

anssik: I'd also like to discuss MLGraphBuilder and MLBuffer construction alignment proposal from Reilly, issue description:

"An MLGraphBuilder and an MLBuffer are both associated with an MLContext however MLGraphBuilder is created with a normal constructor while MLBuffer is created by the createBuffer() method on MLContext."
… proposal "We should consider removing createBuffer() and defining a normal constructor for MLBuffer with the same semantics as MLGraphBuilder (or add a createBuilder() method, but I'd prefer the former option)."
… good feedback from Rafael and Zoltan

jsbell: I think Rafael and Reilly are making a point if calling an async thing and passing appropriate context, implicit or explicitly to constructor and get back synch a thing
… you use constructor
… if async construction different questions, would like to hear Bryan's perspective on behind the scenes async action, sync API returning an object prefers constructor

Bryan: related to initialization, in WebGPU we go to initialize buffer data, that presumably is not a sync operation, init could be async
… we rely on implementation, not always a case, when is init so complex it is not async
… more obvious to developers this is just normal GPU command, defining timelines is another question
… train of thought, objects can be initialized in fairly complex ways

jsbell: would like to hear more from Rafael

Rafael: this is not something I feel super strongly about
… other Web APIs that have contexts, WebGL/GPU, Canvas, all have this "create a thing" flavour
… WebGL mirrored Khronos spec, maybe not index too much on that particular API
… design principles for the web say prefer constructors if returning an object right away
… you can use a factory method with a promise e.g. when creating a bitmap
… if the group decides factory is preferred fine with that, preference to do with constructor

asully: talked with Reilly, that seems inline with his thoughts
… dispatch method e.g. happens on context timeline does not need to be on the context itself
… if you fork an assumption anything with context is a method on the context that means even more in favour of constructor
… slight preference to constructor, also follow up if moving other methods off the context

zkis: want to add to my comment that because of initialization that does not change the object, we can return an object right away and next time we interact with the object we can initialize
… if there's async behaviour should return a promise

Bryan: this is an expensive function to call, could be turned to async in the future
… need to come up with timelines
… may not stay sync for very long

asully: I guess regarding async, I think synchronously returned buffer can be used as a handle when you call dispath, so from script perspective things look sync but happen on this deferred timeline
… can you explain this in the issue perhaps?

– DRAFT –
WebML WG Teleconference – 13 June 2024

13 June 2024

Attendees

Meeting minutes

Announcements

Utilities for spec editors

Implementation status update

Awesome WebNN updates

Hybrid AI exploration update

NPU support

MLBuffer

Direct buffer sharing proposal

MLGraphBuilder and MLBuffer construction alignment proposal

Diagnostics