WebML WG Teleconference – 13 March 2025

Meeting minutes

Repository: webmachinelearning/webnn

anssik: Please welcome Deepti Gandluri from Google to the WebML CG and Brent Zundel from mesur.io to the WebML WG!

Announcements

WebML WG Charter 2025-2027 approved

Call for Participation: Web Machine Learning Working Group Charter approved; Join the Web Machine Learning WG

anssik: key takeaways:
… The group is chartered through April 30, 2027, possible to recharted midterm to add new deliverables
… The Working Group's mission remains the same, focus on WebNN API, we'll also work closely with the Community Group to ensure alignment with new incubations and share expertise
… thanks to the positive feedback from W3C members, the new charter is operational ahead of time
… thank you all for your support!
… Current participants are not required to rejoin, but new participants are welcome, please pass the news to your internal teams who may be interested

WebNN Samples Test Framework

anssik: a new automation test framework for testing W3C WebNN Samples has been contributed to the WebML CG

webmachinelearning/webnn-samples-test-framework

anssik: thank you Ning & Belem for the initial contribution!
… this CLI tool automates running webnn-samples

webmachinelearning/webnn-samples

anssik: supported OSes include Windows, Linux and macOS, supported browsers are Chrome (all channels) and Edge (only Canary)
… this test framework is in scope of the CG "Test Suites and Other Software" deliverable

Web Machine Learning Community Group Charter - Test Suites and Other Software

Incubations summary

anssik: I again asked Etienne to share a summary of the CG's 12 March telcon, thanks Etienne for taking notes

https://github.com/webmachinelearning/meetings/blob/main/telcons/2025-03-12-cg-minutes.md

anssik: as a reminder, minutes are drafts and corrections to the minutes are welcome
… before sharing the summary, I'd like to acknowledge we've received feedback from EMEA participants that a more EMEA friendly time slot would be preferred
… we're looking into alternating the time of the CG telcons to better serve our global community

[Etinne shares a summary of the 12 March CG meeting, for details see the meeting minutes]

Operator set Wave 3 and spin-offs

anssik: PR #805 has been merged

<gb> MERGED Pull Request 805 Operator set wave 3 (by fdwr)

anssik: congrats to the WG and massive thanks to Dwayne for all the updates to the PR, and Ningxin and Josh, others for your in-depth review comments and suggestions!
… as shared earlier, I'll initiate TAG review soon and this op set Wave 3 is a key piece of that review scope
… I closed the big transformers issue #375 that we've referred to for all transformers-related work

<gb> CLOSED Issue 375 Support for transformers (by dontcallmedom) [opset]

anssik: the group now satisfies its requirements for well-known transformer models as outlined in its charter scope
… incremental improvements are expected and we will track future improvements in separate smaller issues now that the foundations are in place

<dwayner> Thank you Joshua and Ningxin for the careful eye 🙏 and the 243 comments! :)

anssik: I want to thank you all again, and convey feedback we received from Julien Chaumond, Hugging Face CTO: "This is insanely impactful work you've been doing, thank you"

<jsbell> Always happy to help, Dwayne!

<ningxin> Thanks for Dwayne, this is a huge contribution!

anssik: Keep up the great work!

dequantizeLinear emulation improvement

anssik: the first spin-off from PR #805

webmachinelearning/webnn#779 (comment)

<gb> CLOSED Issue 779 Support block-wise quantization (by huningxin) [operator specific]

anssik: Dwayne realized dequantizeLinear emulation could be improved significantly if expand() was augmented to accept any from-shape that was an integer multiple of the to-shape

https://www.w3.org/TR/webnn/#api-mlgraphbuilder-expand

anssik: given issue #779 was closed by PR #805, I'd support tracking this as a separate issue

<gb> CLOSED Issue 779 Support block-wise quantization (by huningxin) [operator specific]

<gb> MERGED Pull Request 805 Operator set wave 3 (by fdwr)

anssik: Dwayne feel free to create a new issue for this improvement

Dwayne: good summary, this would enable cleaner composability
… I will open a new issue for that

Gather multiaxis operator proposal

anssik: a second spin-off from PR #805

webmachinelearning/webnn#767 (comment)

<gb> CLOSED Issue 767 Request the decomposition for gatherElements, scatterElements and scatterND (by fujunwei) [operator specific]

anssik: Dwayne has explored a way to more generically express a gathering operation, rather than the 3 distinct operators: gather elements, gather blocks, gather ND blocks
… problem statement: "Is there a more fundamental expression of a gathering operation that is more generic while also being simpler to document and implement?"

anssik: Dwayne's draft Gather multiaxis operator proposal to more generically express a gathering operation has a ton of important details:

https://github.com/fdwr/MachineLearningOperators/blob/master/Multigather.md

anssik: I invite folks to review the draft for:
… - operator equivalence classes
… - proposed multigather operator IDL
… - a prototype implementation of gatherMultiaxis() in JS
… - operator mapping

anssik: This could become another standalone issue? I think the group may need some extra time to look into this

Dwayne: separate issue makes sense, this is still a rough draft, takes a while to digest all the details
… all three ops use the same shader under the hood

anssik: this is pioneering effort

jsbell: wanted to ask about u/int4, rough timeline?

Dwayne: next week

Query mechanism for supported devices

anssik: issue #815

<gb> Issue 815 Query mechanism for supported devices (by anssiko) [device selection]

anssik: thanks everyone for your contributions in the issue
… I'm happy to see active discussion and new perspectives, ideation
… as a reminder, the approach I'd prefer us to use is extract real-world use cases first, then assess implementability and only last go deep into the solution space
… I'll recap the recent feedback received, and then we will open the discussion and folks can fill in the blanks
… Mike's comment "a processing chip is best described in terms of capabilities"
… received a response from Reilly: "this is best in principle however the reality of all the frameworks we have been prototyping WebNN against is that none of them expose which capabilities a processing chip supports"
… and continued "the capabilities we end up exposing through the opSupportLimits() method are the capabilities of the framework in general rather than any of the processors it can target."
… Mike suggested listing opSupportLimits() per processor
… Zoltan suggested passing requiredCapabilities (dataTypes, maximumRank, operators) at contextCreation() invocation that would automatically select a device
… Mike shared "in the WebGPU specification we discussed this a bit as many of the same privacy consideration exist [...] the UA can choose to bucket / report any specific limits / features / operations they wish to mitigate the potential privacy concerns."

https://www.w3.org/TR/webgpu/#privacy-machine-limits

anssik: continued: "perhaps a browser wants all models to run on a class of devices with different hardware support, so it reports the lowest common set supported across all devices even on the higher end devices"
… Reilly suggested defining the processor types in terms of abstractions exposed by the web platform:
… - The "cpu" is the processor where JavaScript and WebAssembly execute
… - The "gpu" is the processor where WebGL and WebGPU programs execute
… - All other processor types are unnamed
… ("cpu" and "gpu" are placeholder names, could be "foo" and "bar")
… Reilly's proposal reads: "The control I propose adding is the ability to request an MLContext with the capacity to invoke ML graphs using the capacity of either the "cpu" or "gpu" processor, or that explicitly does not invoke ML graphs using the capacity of either the "cpu" or "gpu" processor. Implementations are free to ignore this request."
… "Additionally, I propose adding a property which informs the developer which processors were involved in a previous ML graph dispatch task."
… capabilities != capacity, capacity answer a question "can this perform adequately?"
… the two use cases extracted from Reilly's proposals:
… - UC1: "allow the developer to provide a hint to the platform about the workloads it intends to run (e.g. please don't use the GPU for this ML task, there is WebGPU work I intend to schedule as well)"
… - UC2: "allow the developer to understand the performance of their application based on how the browser ended up being able to schedule their workloads"
… Zoltan's refined proposal for includeDevice + excludeDevice as hints to createContext()
… Josh proposed "Maybe [MLContext.]dispatch() could return a Promise that resolves to a dictionary with details of the inference, e.g. what compute units it used, and in the future additional diagnostics like timing?"

anssik: to summarize, we received new use cases (e.g. UC1 and UC2), feedback on implementability (capabilities of frameworks vs. chips), and many proposed solutions for the possible API shape (perProcessorSupportedLimits, "cpu-like" & "gpu-like", includeDevice & excludeDevice hints, dispatch() resolving with inference details)
… I'd suggest Zoltan to extract the use cases from this issue and add them to the explainer

anssik: I'd open the floor for discussion, please queue up

Mike: if we want to expose capabilities of each chip, we have more features in WebGPU than in WebNN, so could mitigate the privacy issues

Reilly: I'm supportive of Zoltan's and Mike's proposal on how to expose processor capabilities in more nuanced way, I'm completely supportive on adding that to the API
… only hesitation is the current frameworks do not support that, not able to query Core ML, TensorFlow etc. for processors to use
… proposal is mainly focused on, given the constraints, what information is useful for developers
… not wanting to explicitly name every processor type, the ecosystem will shift and we want future-proofing
… I can justify calling "something-cpu" and "something-gpu" that tie to Wasm and WebGL/GPU
… capacity of the system to execute JS comes from "something-cpu"
… that reflects the feedback we get from developers, they have a portfolio of features
… you want to hint the browser ahead of time, if certain placement of the work is a good idea

Zoltan: I was happy to see Reilly's take to define cpu and gpu in terms of browser implementation, my hunch is they're not mutually exclusive with Mike's proposal, it is also a simple API easier to implement
… OTOH, the Mike's proposal has been out for a while, might take more work to implement, could do both
… Mike, any concerns to implement with the cpu-like and gpu-like concepts?

Mike: no direct concerns with the proposal, but not immediately clear if cpu-like and gpu-like are useful if the frameworks don't expose this information?

Mike: need to read Reilly's proposal again

jsbell: accumulating inference results on the objects, would prefer to have a more direct way to get to that information

Reilly: wanted to clarify that the two proposals have two different use cases, or developer needs
… 1) tailor for capabilities, 2) capacity question, the amount of compute resources the system has to execute XPU workloads
… propose to tackle the capacity question first

<zkis> +1 to reilly's point, I also see them cater to different developer flows, both seem valid

Reilly: capability question is also interesting, we'd need to hard code the capabilities, which is easy if only a small number of chips are targets

Zoltan: we can wait on Mike, it is important to get his feedback, Reilly has defined the use case clearly, whether we can implement that differently

RafaelCintron: question for Reilly on the capacity proposal, is capacity a moment in time?
… or something that changes over time

Reilly: dispatch returning some information where inference actually executed on is one part
… ahead of time hint, based on what I know, this works on this class of devices, try to run the workload here
… two signals 1) where you should put the workload and 2) this is where the system actually ran it on
… I expect developers to test on machines many users have, representative sample of devices to understand their characteristics

RafaelCintron: how do they know the browsers run on those systems?

reillyg: for fingerprinting reasons, we don't share excess information

RafaelCintron: WebGPU has a device class
… what if the web developer learns "CPU and GPU are off limits", other available devices do not have the required ops, what to do?

reillyg: hint back to developer, "here's where it actually ran" is a useful signal
… potentially room for a device class style hint for WebNN that answers a question "do you have non-GPU-style accelerator?"

Caching mechanism for MLGraph

anssik: issue #807

<gb> Issue 807 Caching mechanism for MLGraph (by anssiko) [question] [feature request]

anssik: no new feedback in the issue since we discussed this last time
… first, I wanted to check if anyone has investigated implementation issues with model cache within browser sandbox and possible solutions on how to overcome the sandbox restrictions safely?
… second, would like to gather further feedback on implementation-defined vs. explicit caching API, but would like to understand the implementability story first

Reilly: I can speak to both, we haven't had a change to look at implementation yet, we believe it is feasible to implement caching in our current Chromium prototype for TFLite and CoreML backends
… sandboxing is not an issue, because this is an origin-scoped cache
… the reason for explicit caching API has to do with weights
… the distribution between WebGPU and WebNN is different, for WebGPU caching shaders make sense, smaller
… weights in WebNN can be reorganized in compilation stage, shader can modify the weights

McCool: I think we have to understand use cases, save download or compile time
… single origin vs. cross-origin caches

Reilly: this is totally implementable, but it is work, however looking at ORT and TFLite they don't have a concept of a cached model
… required implementation work on both browser side and framework side of things

<reillyg> I need to go. Losing my room.

<dwayner> Reilly: ORT has a Session option to dump the optimized file to an .ort file format which reloads more quickly. We might want to ask ORT to add a way to get it via a memory blob rather than file path.

ningxin: one comment regarding frameworks, I'm aware of ORT concept, EPContext, that allow embed a compiled binary blob into ONNX model, encapsulated as a special operator
… worth investigating if this could help with model caching

<McCool> (regarding my comment, my point was an implicit cache cannot save model download time - you need an explicit cache)

– DRAFT –
WebML WG Teleconference – 13 March 2025

13 March 2025

Attendees