WebML WG Teleconference – 25 September 2025

Meeting minutes

Anssi: we'll start by welcoming our latest new participant:
… please welcome to the WebML WG, Jason Mayes from Google!

Anssi: Jason is Google's Web AI lead and a familiar face to many of us, looking forward to working with you in this group!
… also warm welcome to Fabio Bernardon from NVIDIA to the WG!

Fabio: my background is on system software, want to make sure our solutions are aligned with this group's goals and provide end-users an ideal user experience
… and ensure that users get the full benefit of their platforms

Incubations

Repository: webmachinelearning/charter

Anssi: a debrief on the recent WebML Community Group developments

WebML CG Teleconference – 18 September 2025

Anssi: first, call for review of the WebML CG Charter received unanimous support, thank you for your support and contributions
… the new Community Group charter is operational as of today, adds WebMCP as a new deliverable

Web Machine Learning Community Group Charter

Anssi: this means the work on WebMCP can advance from its explainer phase to a spec drafting phase when appropriate
… I'm very pleased to see a diverse group of experts working on this proposal right from the start using experimental implementations to validate early design proposals
… keep up the great work!

Repository: webmachinelearning/webmcp

Anssi: second, we had a productive WebMCP API brainstorming session to discuss:
… - use cases and we resolved that WebMCP focuses on human in the loop use cases initially
… - core design principles, we resolved WebMCP will provide a layer of abstraction between the browser and MCP client aka "SDK option"
… - Naming discussion, discussion still active in issue #24

<gb> Issue 24 Bikeshedding the global name (by bwalderman)

Anssi: - API design, we resolved to continue use the provideContext design and explore adding support for registerTool(options) and unregisterTool(name) to complement the API informed by implementation experience
… - Declarative API, comments welcome via issue #22 and reviews for the explainer PR #26

<gb> Pull Request 26 add explainer for the declarative api (by MiguelsPizza)

<gb> Issue 22 Declarative API Equivalent (by EisenbergEffect)

Anssi: questions comments?

New features and operator specific issues

Repository: webmachinelearning/webnn

[operator specific] issues

[feature request] issues

Support dynamic tensor resizing for slice and resample2d

Anssi: issue #885

<gb> Issue 885 Support dynamic tensor resizing for slice and resample2d (by Honry) [feature request] [operator specific]

Anssi: in the latest comment Wanming explains how he eliminated dynamic nodes in the decoder model from Segment Anything
… I wanted to discuss generalizability of this solution proposed by Wangming
… and understand whether a proper layer where to address this performance issue would be in frameworks, at the WebNN EP level at construction time?
… Dwayne, Ningxin, you wanted to have a chat with Wanming to understand if the WebNN EP construction could be delayed similarly to DML EP so that WebNN EP could resolve patterns of cast/gather/unsqueeze at construction time?

Dwayne: that approach works for that model and operator, a workaround, we don't need this issue specifically if we have the superset issue that's coming next

<ningxin> +1

Anssi: I'd propose we close this issue and redirect to #883 as a general solution

<gb> Issue 883 Support flexible input sizes (by huningxin) [feature request] [operator specific]

Flexible input sizes

Anssi: issue #883
… this proposed new feature is a significant change, so we wanted to gather feedback from key customers before proceeding
… we had identified ONNX Runtime and Transformers.js as WebNN API customers with this requirement
… at our last meeting we received "a massive +1" from Joshua of Transformers.js
… Reilly noted this must be working on ORT WebGPU EP now, and Phillis confirmed Core ML supports dynamic shapes on CPU

Mike: correct
… Anssi: I think the remaining open was to check with Guenther for ORT Web and WebNN EP perspective
… do we have new information, or should we defer this for later?

Dwayne: no new information from Guenther yet

Reilly: this has been on my todo list, even the question that Core ML only support this on CPU is relevant, want to use what semantics frameworks use

<Zakim> anssik, you wanted to ask

Core operator set

Anssi: issue #573

<gb> Issue 573 Core operator set (by philloooo) [question] [opset]

Anssi: I wanted us to revisit our core operator set effort that aims to identify current primitive gaps by mapping compositional fundamentals to WebNN operators
… the current WebNN op set contains ~92 operators and is informed by requirements of popular models
… both low-level and certain high-level fused operators are included, and all high-level ops can be decomposed into low-level ops
… in non-normative blocks in the spec we include code that demonstrates implementability of these decompositions
… a few examples of high-level ops whose decompositions are more elaborate include e.g.

lstm()

lstmCell()

gru()

Anssi: if you scroll to the bottom of these sections and search for "can be generically emulated" you'll find the emulation code
… I linked to the publicly shared speadsheet Dwayne put together some time ago:

Machine Learning Operator Mapping

Anssi: there's 3 tabs: Operators Rearranged, All Raw Operators, Linalg comparison
… if we look at the "All Raw Operators" tab, it includes the following compared against all WebNN operators:

Anssi: this is a lot of data, thanks Dwayne for sharing this publicly
… the high-level aim of this exercise is to make sure we have the right compositional fundamentals, aka low-level ops
… that allow us to decompose all high-level ops included into these low-level ops
… our current thinking is some of these low-level ops may be only useful for composition, but not as is, and this is OK
… we have added the following new primitives based on this research:

- rounding operators (issue #817, PR #859)

<gb> CLOSED Issue 817 Rounding operators (by fdwr) [feature request] [interop]

<gb> MERGED Pull Request 859 Add `roundEven` operator (by fdwr)

- isNaN and isInfinite (PR #858)
… blocked by lack of platform support is:

<gb> MERGED Pull Request 858 Add `isNaN` and `isInfinite` operators (by fdwr)

- bitwise ops (and, or, xor, left shift, right shift, not), Core ML does not currently support, proposed as an optional extension
… to be researched for feasibility:
… - modulus/remainder, flooring divide
… - sumPool/minPool
… - random number generation
… - expand to support multiples of block sizes
… - relax dimension limitations (e.g. support conv1d and conv3d too, not just conv2d)

Dwayne: I don't have notes written beyond the latest comment, would be informative to have input from people who worked on TOSA and PyTorch prims and others
… if no additional issues, I'll tackle the todo items

Reilly: wanted to say some of the pressure on this issue has been taken off by opSupportLimits work, we can express what ops are supported in which implementation
… one aspect of this problem, we want to make sure people using WebNN know what it can support on the given system
… the next bit that's valuable is both checking which ops we could add to WebNN because they're supported and expand the scope of ops, and understand ops that frameworks expect to exists and make sure those are expressible in WebNN
… this is where feedback from Joshua of HF is very helpful

Dwayne: agreed

Reilly: would like to pull in NVIDIA folks, a valuable contributions would be to understand places where the underlying HW has fast paths, places where we might have compositions, but what exists in current gen HW may need something different
… many decompositions can be refused, but requires a compiler that does not exists yet, ML model compiler state of the art is immature cf. other compilers
… so we need to balance against that reality

Ningxin: I'd second Reilly, from the particular model perspective, when we look at SLMs Phi mini, TinyLlama and ONNX models running through WebNN EP, very specialized high-level operators e.g. various attentions, GQA, MatMulNBits
… we could do some research on those language models, because they are highly optimized, to see how to decompose those ops to primitives, and identify any op gaps, and investigate performance impact

<DwayneR> Ningxin: Your WebNN function idea from TPAC (composed minigraph representation with a decomposition) is useful here.

Ningxin: what is the performance impact of decomposed form, fusing ops back

Ningxin: we have been doing some investigation for those language models, we can gather some data and share in the issue

Fabio: I will get back to the group after talking with the NVIDIA team

WebNN-WebGPU interop

Anssi: I wanted us to refresh the "webgpu interop" triaged spec issues in the light of new implementation experience from ML-GPU interop work in Chromium

"webgpu interop" issues

Anssi: I believe some participants are interested in making progress in this space toward the end of the year, so want to understand which issues to bump in priority
… a lot of this is under-the-hood implementation work to get backends interop in a performant manner
… any WebNN API shape implications or possible changes suggested, or is the current MLTensor abstraction working well?
… there's some POC work to enable WebGPU-WebNN interop with ORT on UMA devices, interested in any learning from this

https://chromium-review.googlesource.com/c/chromium/src/+/6962138

Reilly: the prototyping work in Chromium seems to be going well, we have a working implementation of export on DirectML and Core ML, would like to add this to TFLite backends, work in progress
… we seem to be successful in implementing MLTensor as proposed
… the other piece is, chatting with Phillis, we designed the export method to be async based on early exploration in implementation feasibility and we think it might be possible to make it sync, and that might be the next change to happen to the current proposal, if we succeed in that implementation exercise

Anssi: sync benefit?

Reilly: performance, when moving from CPU to GPU must schedule a new GPU task, supporting sync would allow full pipeline to get sent in a one go

Ningxin: implementation experience from context creation would be beneficial, creating from WebGPU device vs. from NPU device
… in this context to interop with WebGPU, there's interest in application developers to allow NPU context interop with WebGPU

Rafael: the answer to Ningxin's questions, enabling people to use inference on NPU, the browser needs to take care of copying the data between the two adapters

Ningxin: thanks Rafael, my questions is if the developer needs to know if this path is efficient, is it zero copy or not?

Rafael: we have to experiment and look at the results

Reilly: looking at frameworks, initially DirectML was the only that tied context to GPU and other devices, other frameworks worked on higher level to span across CPU, GPU, NPU
… as Rafael said, implementation figures where to keep the buffer based on the given hint
… also adjust that based on real-world usage, buffer could be moved around

Anssi: it would be interesting to see an explainer for various approaches to UMA

Privacy considerations

Anssi: issue #886 and PR #890

<gb> Pull Request 890 Revise privacy considerations (by anssiko)

<gb> Issue 886 Revise privacy considerations (by anssiko) [privacy-tracker]

Anssi: I submitted a PR with a first stab at revising the privacy considerations section in response to privacy review we received
… the suggested changes are the following:
… - Add a new introduction section
… - Add Fingerprinting subsection, revise and expand content: note design principles wrt operators, MLContextOptions, opSupportLimits()
… - Add Execution Time Analysis subsection
… - Add WebGPU Comparison
… I requested initial review from Reilly because he had the context but welcome review from others as well
… as you see, fingerprinting is the key consideration so it got its own subsection to highlight the mitigations
… also Execution Time Analysis, the known privacy issue inherent to any compute API, got its own section and welcomes further input from experts
… lastly, we carve out discussion on WebGPU comparison into its own subsection

Reilly: didn't have a chance to look yet

MLGraph Cache

Explainer

Anssi: this topic was included to check any feedback from Apple, MikeW?

Mike: looked at this briefly, not completely clear why this does not happen implicitly
… or is this a detail I missed

<Zakim> reillyg, you wanted to explain.

Reilly: I swear this was in one of the explainers, Zoltan was in the implicit caching camp
… the example here is an API that does implicit caching, Wasm and WebGPU do that, not specified per se, but that's how Wasm does it when a module is loaded, we stick the compiled version in the same bucket so if it is compiled we reuse that instead
… in WebGPU folks reuse compiled shaders
… the difference to WebNN is the size of input, in Wasm the module size and WebGPU shaders, these are small, Wasm 10-20 MB, whereas ML models, some small, but can be from 100 MB to +1GB
… the GPU shader model does not work
… that doesn't work here very well, would be inefficient
… with a model loader approach we could possibly follow the Wasm approach, but that does not work with a model builder design
… caching is asking the developer "give me a name for the model", and the model is keyed on that

<gb> Issue 807 Caching mechanism for MLGraph (by anssiko) [question] [feature request]

Mike: I see that as per-origin, no privacy concerns with sharing models

Mike: everything looks good to me, thanks

Query supported devices

Anssi: this problem space is split in two: before and after graph compilation
… and we need to decide whether we want to proceed with "before" case, "after" case, or with both

Before graph compilation

Anssi: issue #815 and PR #884

<gb> Pull Request 884 Update explainer with new proposal for simple accelerator mapping (by zolkis)

<gb> Issue 815 Query supported devices before graph compilation (by anssiko) [device selection]

Anssi: the PR was updated by Zoltan (thanks again!) based on feedback from the previous call to a simplied boolean-returning context.accelerated API

Summary of changes

<gb> Issue 815 Query supported devices before graph compilation (by anssiko) [device selection]

Anssi: to make the latest proposal easy to review and comment on, here's the proposed IDL change from the explainer:

partial dictionary MLContextOptions {
  boolean accelerated = true;
};
partial interface MLContext {
  boolean cpuFallbackActive;
};

Zoltan: that's the minimal proposal, only have this property, because context option only makes sense if the UA cannot have acceleration to satisfy the use case raised by Reilly in the issue
… if we don't want to deal with the error, the UA will always try to accelerate and will always default to true
… only difference is how we want to handle explicit error at context creation if the platform cannot support acceleration
… another consideration, do we want to expose this as an event or want a property
… also opSupportLimits by device is another possible design, sharing with the post-graph compilation

Rafael: I think boiling it down to booleans seems fine, would like to understand event and property on context, is this due to context getting lost?

Zoltan: for app to know when there's CPU fallback, wanted to figure out how to spec events, property would be easier to integrate into existing algorithms

Rafael: that'd be aligned with WebGPU/GL, would like to hear from Markus if this satisfies his requirements

<jason> lol

<jason> i hear it

Markus: couple of use cases

[ use cases x 1000 ]

Markus: power preferences plus this sounds OK, for large models that need to have acceleration
… if you want low latency, maybe the framework only supports CPU backend

Zoltan: NPU could be an option
… could add "low-latency" to context options
… you only check if CPU fallback is active?

Markus: for audio processing you have latency going back and forth, wait for completion, in those cases we still want to use the benefit of WebNN CPU acceleration, define that as accelerated == false

Zoltan: you'd want context creation with accelerated == false, do you want an error in that case so the UA would know that's not possible?

Markus: getting access to MLContext and reading that would be fast, error not needed in this case

Zoltan: thanks this would be simple

Markus: if the system is running the model and falls back, that'd be interesting to know, if we have a property then we'd need to poll that property
… we can poll the property on every frame

Zoltan: I will move forward finishing this PR and put up a spec PR for comments

Open PRs

Anssi: PR #857

<gb> Pull Request 857 Support rankRange for op output tensors in opSupportLimits (by huningxin)

Anssi: this PR adds rankRange for graph input, constant and output in opSupportLimits()
… we have two approvals, editors, please feel free to merge at your convenience -- thanks for your work on this!

Anssi: the long-standing PR #770 was also merged this week to bake in the rename pool2d roundingType -> outputShapeRounding

<gb> MERGED Pull Request 770 Rename pool2d MLRoundingType - Simplify the operand layout support of conv2d and pooling 2d operations (by fdwr)

Anssi: thank you again editors and reviewers and all contributors
… great to see the PR queue get shorter

– DRAFT –
WebML WG Teleconference – 25 September 2025

25 September 2025

Attendees