Web Machine Learning WG F2F – 10 November 2025

Meeting minutes

<DenisD> Remote + Denis_DIDIER

<gb> https://github.com/webmachinelearning/meetings/issues/35

Repository: webmachinelearning/webnn

Welcome

Anssi: welcome to the W3C Web Machine Learning WG F2F at TPAC 2025, this is our second physical F2F
… I'm Anssi Kostiainen, Intel, the chair of the WG
… with me is Dom, Dominique Hazael-Massieux, W3C Staff, helping run the meeting smoothly
… again great to see so many folks here in person and new people outside the usual WG participants, including participants and guests who represent Japanese W3C members and organizations
… Arigato gozaimasu!
… this WG has continued to grow rapidly since the last year, we have all major browser vendors on board and new folks are joining
… the YoY groth is around +30% in both organization and participants, for both this WG and its sister CG
… a few new members who joined the WG since last F2F:
… Hugging Face
… Qualcomm
… NVIDIA
… ARM
… Shopify
… we are working at the intersection of Web & AI/ML technologies during this time of exponential growth in AI and we've luckly to have such a diverse group of experts onboard:
… all browser vendors, OS vendors, major semiconductor companies invested in AI, major platform providers, ISVs, distinguished researchers from the academia, individuals, and more
… if you registered as a WG participant, please join us at the table
… observers are welcome to join the table too subject to available space

Anssi: we use Zoom for a hybrid meeting experience, please join using the link in the meeting invite

Anssi: we use IRC for official meeting minutes and for managing the speaker queue
… please join the #webmachinelearning IRC channel, link in the meeting invite and agenda:

https://irc.w3.org/?channels=#webmachinelearning

webmachinelearning/meetings#35

<gb> Issue 35 WebML WG/CG F2F Agenda - TPAC 2025 (Kobe, Japan) (by anssiko)

Anssi: to put yourself on the queue type in IRC "q+"
… during the introductions round, we'll try to record everyone's participation on IRC with:
… Present+ Firstname_Lastname
… please check that your participation is recorded on IRC to we're able to acknowledge your presence in the meeting minutes

Intros

Anssi: since we're many again, we'll do a quick round of introductions, 15 seconds each, full name, affiliation and key interest

Dwayne: WebNN spec editor, with a focus on new operators, at Microsoft

Rafael: also at Microsoft, on Edge browser, working on all things AI and graphic rendering

MikeW: at Apple, involved in WebNN and also WebGPU

Denis: involved in Sustainable Web Guidelines

Phillis: at Google, working with Reilly on WebNN implementation on Chromium

<DenisD> Introduction : Denis DIDIER, from France - Company ITHENKA, Contributor to W3C Sustainable Web Guidelines, and Sustainable AI with french non-profit Institute for Sustainable IT.

Reilly: implementing WebNN in Chromium

Shushan: at Microsoft, built-in AI

ErikA: manager on edge browser team

Ugur: working on AI solutions for the construction industry as chief AI officer

AndrewW: at ARM, involved at our open source and standards strategy team

Tarek: from Mozilla on the Firefox AI team

Markus: from NVidia, devtech supporting ISV integrating ML in their apps, getting involved in standardization to help their lives

Ningxin: co-editor of WebNN spec at Intel

Dom: staff contact for the group and looking at impact of AI on the Web

@@@: at Google looking at integration of WebNN in Chromium

MarkF: at Google, working on Chrome on AI & agentic features, involved on WebMCP

Kenji: chrome, built-in AI

Ben: Chrome, similar to Mark on agentic AI

@@@2: Chrome team, WebMCP

Thomas: devrel at Chrome

Ali: program manager in Google, supervising ML/GPU work

<kbx> @@@2 is Rob Kochman.

DavidEzell: Connexus - excited by this group; we're a standards body hoping to ruborchrage retail vendors with our standards

Brian: @@@

YutaHagio: working for NHK, Japanese broadcaster

ChiaraCerretti: @@@

GuidoU: Google, WebRTC APIs in Chrome, exploring application of AI

MarkusH: Google Meet, interested in AI/WebRTC

Diogo: Brazilian W3C Office

SamGoto: Google Chrome, platform APIs

@@@: Meta browser

Masao: @@@

Agenda bashing

Anssi: the F2F agenda was built collaboratively with you, the WG participants and is published on GH:

webmachinelearning/meetings#35

<gb> Issue 35 WebML WG/CG F2F Agenda - TPAC 2025 (Kobe, Japan) (by anssiko)

Anssi: any last-minute proposals or updates?

<kbx> Meet after the meeting for meat or no meat.

Charter orientation

Anssi: we have two groups, Web Machine Learning Working Group (WG) and Community Group (CG)
… WG standardizes Web APIs for on-device inference using CG incubations as its seeds
… deliverables: WebNN API, Ethical Principles

WebML WG Charter

<dom> s|2023/04/2025/03|

Anssi: we're looking to make the Ethical Principles as a joint deliverable with the proposed Web & AI Interest Group
… this informative document is a reference from the WebNN API spec
… CG is a group where new ideas are discussed, explored and incubated before formal standardization
… past CG spec incubations include e.g. WebNN, Model Loader
… since last year, we've expanded the scope of the CG to built-in AI APIs and agentic web capabilities

WebML CG Charter

WebML CG Incubations

Anssi: current CG deliverables:
… Prompt API
… Writing Assistance APIs
… Translator and Language Detector APIs
… Proofreader API
… WebMCP API
… the CG technical scope is higher-level task-based APIs and agentic web feature WebMCP
… while the WG technical scope is a lower-level WebNN API, the graph builder abstraction

Anssi: the WG and CG work closely together and coordinate with other W3C groups, for example:
… - WebGPU WG/CG for WebNN-WebGPU interop
… - Wasm CG
… - WebRTC for media processing-related integrations
… - AI Agent Protocol Community Group for agentic protocols
… - And with horizontals: privacy, security, a11y, also emerging sustainability and ethics

dom: we operate under W3C CoC and antitrust guidance for W3C, under W3C Patent Policy

Spec orientation

Anssi: we have scheduled time before our first break to do a triage pass through open issues
… the plan is to collaborative look at our backlog or issues and PRs to:

https://github.com/webmachinelearning/webnn/issues

Anssi: - focus on breaking changes
… - check priorities
… - set next steps for the issues
… let's use IRC to queue proposals, examples:

<gb> Issue 573 Core operator set (by philloooo) [question] [opset] [Agenda+]

<gb> Issue 883 Support flexible input sizes (by huningxin) [feature request] [operator specific] [Agenda+]

<gb> Issue 861 Evaluate sustainability impact (by anssiko) [tag-needs-resolution] [Agenda+]

Anssi: this is your last-minute opportunity to influence today's agenda
… we'll first record triage results on IRC during the first ~15 mins and then review as a group and continue discuss and refine on the hallway track with coffee/tea

<gb> Issue 226 Integration with real-time video processing (by dontcallmedom) [use case]

<gb> Issue 6 Custom operations (by dsmilkov) [v2] [device selection]

<gb> Issue 763 Request standards positions from Mozilla and WebKit (by reillyeon) [process]

<gb> Issue 807 Caching mechanism for MLGraph (by anssiko) [question] [feature request]

<Zakim> anssik, you wanted to propose next steps for #573 and to discuss priority of #883 and to bump the priority of #861

<gb> Issue 861 Evaluate sustainability impact (by anssiko) [tag-needs-resolution] [Agenda+]

<gb> Issue 573 Core operator set (by philloooo) [question] [opset] [Agenda+]

<gb> Issue 883 Support flexible input sizes (by huningxin) [feature request] [operator specific] [Agenda+]

<Zakim> reillyg, you wanted to discuss the priority of #226 versus WebGPU interop work and to discuss whether we have actionable next-steps on #6 and to discuss (likely in the context of the OT discussion) next steps on #763 and to discuss #807 ahead of prototyping work that I believe will be starting soon and to discuss defining a process for the WG to accept or modify operators.

<gb> Issue 226 Integration with real-time video processing (by dontcallmedom) [use case]

<gb> Issue 6 Custom operations (by dsmilkov) [v2] [device selection]

reillyg: identified a couple of issues worth discussing, incl introducing future work
… issue #6 on custom operator support
… we're getting close to origin trial chromium, so we should request standards position from WebKit and Mozilla
… without discussing operator support in deep detail in this meeting, it might be useful discussing a process of adopting new operators or modifying existing operators

<ningxin> This is what we have for op change process: https://github.com/webmachinelearning/webnn/blob/main/CONTRIBUTING.md#proposing-and-adding-a-new-operation

WebNN Small Language Model (SLM) Performance Optimization Case Study

Slideset: https://lists.w3.org/Archives/Public/www-archive/2025Nov/att-0000/WebNN_SLM_Optimization_-_TPAC.pdf

New features

WebNN Small Language Model (SLM) Performance Optimization Case Study

Anssi: I've asked Ningxin to present a WebNN Small Language Model (SLM)

Performance Optimization Case Study conducted by a group of our engineering team
… thank you Yuheng, Wei, Wanming, Jonathan, Ningxin for producing this case study to inform WebNN new features discussion with considerations for topics such as:
… - WebNN SLM support and challenges, operators' fusion, on-device KV, tensor binding, dynamic shape
… after this case study, we will proceed with discussion

ningxin: looking at how to optimize performance for a small language model, based on QWen small model
… case study conducted by a Team at Intel

[Slide 3]

Ningxin: we reused the model from the native ORT-GenAI project
… in this experiment, we focused on the WebGPU EP
… and ran the same model in the webnn-based stack using the WebGPU EP as well

[Slide 4]

[Slide 5]

[Slide 6]

Ningxin: looking at the contribution to inference time of key operators, starting with the native stack

[Slide 7]

Ningxin: the ONNX macro ops get decomposed in WebNN operators
… GroupQueryAttention requires 24 WebNN operators

Anssi: thank you Ningxin & team for this insightful case study
… Accessing to optimized macro ops is key for SLM performance
… MatMulNBits, GQA etc.,
… Support in spec or fusion in implementation?
… Support dynamic input shapes?
… Allow same tensor for input and output?
… Decouple tensor binding and graph dispatch?

reillyg: the biggest question on all these API change proposals is doing the research on what's possible on the current WebNN implementation

<Zakim> reillyg, you wanted to discuss next steps based on SLM presentation.

reillyg: we investigated dynamic shape, tensor binding (backends have support for this)
… this would be useful also for real-time application
… same-tensor in input/output: the main question is whether the graph supports it
… maybe a binding step could be used for validation

Dwayne: impressive speed up identified in the case study

Rafael: +1 on the next steps identified

Ningxin: we could look at another backend, like LiteRT
… and compare it to using ONNX
… both would be based on the same PyTorch model
… MatMulNBits allow to set accuracy which the underlying implementation can use to accelerate the inference
… not clear how to encode this when doing fusion

Core operator set

Anssi: #573

<gb> Issue 573 Core operator set (by philloooo) [question] [opset] [Agenda+]

Anssi: in the case study discussion we learned many of the SLM building blocks are key for performance:
… - MatMulNBits
… - GroupQueryAttention (GQA)
… - SkipSimplifiedLayerNorm / SimplifiedLayerNorm
… - RotaryEmbedding
… the question to the group is, should we support them in spec or with fusion in implementation?

WebML WG Teleconference – 9 October 2025

Anssi: NVIDIA team reported they're collecting all the ops that'd benefit from being in the set, one class is various attentions, also gathers, MoE, TopK, and looking for other ops that'd benefit from not being composed

Markus: operators proliferation - some operators can't be decomposed into existing operators
… part of what we need to consider is whether operators we expose in the spec are going to remain useful for long enough

<DwayneR> There's a 3rd possibility too (not just built-in operator vs recognized fusion) - support subgraph composition, so that these complex operators (they are really entire graphs) are not permanently baked into the API, but still can be recognized easily and passed through to the backends.

Markus: can we enhance WebNN to be a multi-layers, so that we can have decomposition happen at the discretion of the browser or the backend, carrying over optimization down to the hardware as necessary
… which operators are complex enough and can't be decomposed into existing operators
… how to expose compound operators in WebNN

Dwayne: TopK feels like a primitive operator that should be added to WebNN
… MatMulNBits on the other hand is a subgraph

Anssi: Dwayne had done some extensive research on this topic and re-raised the concept of aggregate operators via subgraphs

webmachinelearning/webnn#573 (comment)

<gb> Issue 573 Core operator set (by philloooo) [question] [opset] [Agenda+]

Dwayne: another possibility is to support the concept of subgraphs that can be referred to as operator later, as Ningxin described at TPAC last year
… we don't have a specific issue for that atm
… I'll open one

Anssi: Markus, could you capture your feedback on github? maybe on a meta issue around optimization

Markus: we fully agree with what has been described so far

Markus: if we would have sub-graphs, it would also optmize the number of nodes and speed up computation, as Ningxin was describing

<dom> s|ningxin_slides|https://lists.w3.org/Archives/Public/www-archive/2025Nov/att-0000/WebNN_SLM_Optimization_-_TPAC.pdf

RESOLUTION: open a new issue for aggregate operators via subgraphs

Support flexible input sizes

Anssi: issue #883

<gb> Issue 883 Support flexible input sizes (by huningxin) [feature request] [operator specific] [Agenda+]

Anssi: this issue was initiated due to the fact many models e.g. vision models require flexible input sizes that are determined at inference time
… a few scenarios mentioned:
… - input images with different resolutions for vision models
… a concrete example would b MODNet, a model for real-time portrait matting

https://huggingface.co/Xenova/modnet

Anssi: - transformers with arbitrary input lengths
… when using KV cache, need to increase the KV cache length by 1 per each inference
… - speech recognition
… example Whisper encoder with arbitrary input lengths
… decoder increases the KV cache length by 1 per inference
… - LLMs
… for example Qwen2.5-0.5B-Instruct also increases KV cache length at inference time
… "Lack of the support for flexible input sizes increases the complexity of using WebNN for those models."
… - complexity reduction: now need to modify the model and fix input size pre compile
… - at inference time: now need to resize the image input or pad input
… - native support: flexible input sizes are already supported by native frameworks

Anssi: Dwayne responded with considerations:

webmachinelearning/webnn#883 (comment)

<gb> Issue 883 Support flexible input sizes (by huningxin) [feature request] [operator specific] [Agenda+]

dwaye: [voicing his thoughts written up in webmachinelearning/webnn#883 (comment) ]
… a step between binding and building the graph would help

Anssi: per our latest discussion, we're now exploring the following as a group:

anssik: reillyg raised questions around how this would get implemented on existing backends and performance implication

Anssi: - how this will be implemented by backends
… - what is the role of WebNN in this decision, the framework could build multiple graphs
… - understand performance bottlenecks (of multiple graphs)

reillyg: all the backends we target some form of dynamic shape
… two APIs part: where the dynamic shapes are idenitifed in the graph? (either arbitrary or among a set of well defined)
… what API to switch to dynamic?
… then figure how to translate that with the backends while considering the performance impact

Anssi: MarkusT shared there are three different types of dynamic shapes usually used in neural network models:
… - Completely Unknown Sizes
… - Symbolic Sizes
… - Tensor-Derived Sizes

TensorRT: Working with Dynamic Shapes

Markus: [voicing webmachinelearning/webnn#883 (comment) ]

<gb> Issue 883 Support flexible input sizes (by huningxin) [feature request] [operator specific] [Agenda+]

Markus: Symbolic Sizes feels like the most interesting for WebNN
… it might be useful to have a way to define the ranges of sizes to optimize backend preparation
… anyone feels symbolic size wouldn't work for them?

Ningxin: if we define symbolic size, does that include calculation on these symbols?

Markus: that would need more discussion; the more maths we allow, the more complexity

Ningxin: so we should start with the simplest approach and iterate as we identify the needs

Markus: dispatch would be the phase where the dimensions would be updated

Ningxin: this would match Chromium's implementation

reillyg: I wonder if in most cases, the need to express complex mathetmical functions for shapes goes away
… with the risk that in some intermediate nodes it would not be possible to express the computed bounds of an operator
… there are two separate pieces: API shape and validation
… the WebNN API only has developers provide shapes for inputs
… the API then provides back to developers the shapes of intermediate nodes, computed based on the operators
… with static shapes, this can be computed statically at the time of graph building
… if we move to dynamic, we either have to say "we don't know" or express it with symbols

Markus: just saying "it's dynamic" feels reasonable

reilly: we use the computed shape of a graph as part of the validation internally
… I would have to check if dynamic shape, this would have to be done by the developer themselves

DwayneR: we should distinguish flexible model size and dynamic shape

Markus: that matches "tensor-derived sizes" in my taxonomy

Ningxin: some prototyping & study would usefully inform this feature

reillyg: looking at the existing graph validation code and how much more complicated it becomes with dynamic shape

RESOLUTION: study more backends and do prototyping before more formally specifying solution

Ningxin: once we have that prototype in the browser, we should look at whether it makes it easier to deploy existing language models without modification

API to Separate Graph Building from Weight Loading to Reduce Peak Memory Usage

Anssi: proposal issue #901 by Markus

<gb> Issue 901 Proposal: API to Separate Graph Building from Weight Loading to Reduce Peak Memory Usage (by mtavenrath)

Anssi: a proposal to reduce peak memory usage
… using Stable Diffusion as a reference, CPU memory used during graph building is more than 3x the actual model size using dGPU, also high on iGPU
… proposal to introducing an API that splits graph creation from weight passing i.e. loading the constant data
… to enable streaming weights directly into the graph during initialization
… do all current WebNN backends support a weightless graph creation, where all tensor shapes and data types are known, but the actual weight data is not provided until a later step?
… using dGPU limit the peak CPU memory overhead to 1x-2x the size of the largest single tensor
… using iGPU no temporary CPU-side storage would be needed for the "upload" as it's shared memory, reduce the total peak CPU memory consumption down to roughly Model Size + Max Single Tensor Size

Markus: my experience is that even on desktops/notebooks still have limited memory, typically 16GB
… even more so on mobile devices
… right now, models in WebNN can't make use of all available memory due to the memory used for loading them

reillyg: model caching is related to this: how do we get the model weights the developer provide to the underlying framework the most efficiently possible?
… the frameworks want to see the weights to repack them to match memory layout
… we are constrained having weights available during graph building
… but clearly graph building isn't memory efficient for now

<ningxin> Would constant tensor can solve this issue? https://www.w3.org/TR/webnn/#api-mlcontext-createconstanttensor

reillyg: more an implementation issue I believe
… some of the changes we need for caching would help address this performance issue
… maybe from an API perspective would be for situations with a very large constant we wouldn't want to load in the memory at all
… which could be improved by streaming the constant

Markus: do we have the list of operators that would require the constant to be known at build time?

reillyg: we can get that list of operators
… it's a constraint we get from backends we wish we didn't have

Markus: maybe this is something we can reach out to backend developers to change

ningxin: would the constant tensor help with this? https://www.w3.org/TR/webnn/#api-mlcontext-createconstanttensor
… the issue is with some underlying AI runtimes don't support this; with DirectML, we can do this, but on ONNX, AFAIK there is no way to do this, we have to put everything on CPU during session creation time
… in terms of frameworks, ONNX runtime web needs all the weights to be on CPU
… we can have the API shape, but it needs adjustment both in underlying runtimes and frameworks

Markus: right, this is an ecosystem effort in which WebNN is at the center

ack +

Rafael: in the browser, this is a multiprocessor architecture, the model MUST run in a different thread for security
… being able to share memory across processes would be good for performance, but challenging for security
… maybe not insurmountable, but there was discomfort

Markus: zero-copy would be the dream, but reducing from 10 copy to much fewer is critical to make this usable on real world conditions

reillyg: looking at the implementation sketch Markus put together, that almost match how we're doing it: when a dev gives us a constant, we create a handle to it, we start uploading that constant to the backend, and let the developer continue building the graph
… the problem right now is less than the API doesn't allow you to only keep roughly the largest tensor worth of memory - it does
… but all the implementations on the browser side and on the JS side aren't handling this very well
… ONNX runtime web requires everything to be in JS memory
… similarly today when we create a constant, we keep it in the memory in the browser - but we don't have to
… what Markus describes is how we intend the current API to work, but that's not how implementations exist today

Anssi: is there any spec change that we should make out of this conversation? any normative change for WebNN to enable this optimization?

reillyg: the only two things we might change: if we have an issue with very large constants, we may want to add a streaming constructor for constants
… and a feedback mechanism to let developers know when they can start loading the next one, with a backpressure mechanism to manage peak memory

Dom: any cooperation we should facilitate with backends/frameworks?

reillyg: I assume the ONNX Web runtime is aware of the memory issue

Ningxin: the main issue I think is on the backend side, and whether this would work with the various hardware EPs
… on the JS framework, we can probably have a solution

reillyg: adding a streaming constructor would also open the door to the backpressure feature I was describing

RESOLUTION: explore streaming constructor for constants

RafaelCintron: +1 on the importance on getting this fixed to be clear
… I know the ONNX runtime is trying to fix very similar issues
… this will also be needed for WebGPU interop

Device selection, state of the union

[device selection]

Anssi: a bag of issues
… we have explored API surface-level enhancement for both "before graph compilation" tied to MLContext and "after graph compilation" tied to MLGraph object
… recently we reached consensus to add a simple accelerator selection mechanism
… issue #815 was addressed by PR #895

<gb> #895

<gb> #815

Anssi: the minimal design the group landed on is an `MLContext.accelerated` boolean:

interface MLContext {
  undefined destroy();
+ readonly attribute boolean accelerated;
  readonly attribute Promise<MLContextLostInfo> lost;
};

Anssi: a corresponding explainer update was #884

<gb> MERGED Pull Request 884 Update explainer with new proposal for simple accelerator mapping (by zolkis) [device selection]

Anssi: we spun off issues for further discussion:
… #897 to device "underlying execution device" concept

<gb> Issue 897 Define "underlying execution device" concept (by anssiko) [device selection]

Anssi: #900 for CPU fallback hint

<gb> Issue 900 CPU fallback hint (by anssiko) [device selection]

Anssi: #902 usecase-driven scenenarios

<gb> Issue 902 Device selection criteria for usecase-driven scenarios (by fdwr) [device selection]

Anssi: we also have a spec issue #836, PR #854 and prototype implementation for `MLGraph.devices` API

<gb> Pull Request 854 define graph.devices (by philloooo) [device selection]

<gb> Issue 836 Get devices used for a graph after graph compilation (by philloooo) [device selection]

Anssi: the latest on this is MarkusH and MikeW are exploring use cases with this design
… privacy is the key concern with this proposed API enhancement

Anssi: issue #759

<gb> Issue 759 MLOpSupportLimits should be opt-in with base functionality (by mwyrzykowski) [device selection]

Anssi: this proposal from MikeW for providing an API for listing operator support limits is informed by a similar API in WebGPU:

WebGPU limits

Anssi: the proposed MLOpSupportLimits API returns all available devices with their operator support limits
… using this information, the web app can choose one of them to initialize a context with

Device selection criteria for usecase-driven scenarios

Anssi: issue #902

Anssi: any device selection feature we design should be motivated by a real-world app scenario / use case

Dwayne: no concrete proposals here
… the question is how to find the right balance between leaving more freedom to the UA and allowing situations were more device control is required

Markus: the problem is made complex because there are not only CPU/GPU/NPU, but several GPUs, NPUs, sometimes of different vendors
… WebNN is a really good target for vendors seeking to deploy on the Web interoperably
… one situation that is challenging is when they need to run multiple models at the same time
… when professional users have multiple powerful GPUs, we wouldn't want the privacy protections to make it impossible to fully take advantage of their hardware
… I wondered if a permission prompt similar to camera/mic could be acceptable, which would then get access to full query of devices while avoiding slient fingerprinting

Rafael: WebGL and WebGPU have a way to pick a specific highperformance adapter
… with a restriction on iframes
… wrt prompting, neither WebGL or WebGPU have a prompt - how do you handle the situation where the user say no because of prompt fatigue
… fingerprinting is a real issue - WebGL has been massively used for fingerprinting based on telemetry
… I'm OK with allowing access to high performance GPUs, and maybe consider a permission prompt for super advanced use cases

reillyg: +1 to Rafael
… a solution where by default you the GPU the browser identifies the best
… I don't think the current WebGPU implementation in Chromium allows to use multiple GPUs
… Maybe WebNN should allow to query for NPUs

RafaelCintron: as far as Chromium is concerned, high-perf request is only supported on Mac
… not supported on Windows - maybe coming in the future

<DwayneR> Itadakimasu 👋.

Erik: how much do we need to explore the permission prompt? vs an enterprise policy optional API?

Markus: my main point is to make sure we consider scenarios with more complex device selection then just per type
… both because multiple devices of a given type might exist, or because you need to keep a particular job on a device where another job is happening (e.g. decoding)

erik: how much does this need to be driven by the app vs via hints?

Markus: I'd be fine with hints, but I'm skeptical they'll suffice
… Another case might be benchmarking - including done by the app for device selection

dom: on the web platform, always need to balance use cases vs. privacy, 80/20 rule, hints vs direct control needs this consideration
… we're might be adding a huge new fingerprint surface

Markus: we'd put this behind a permission prompt

dom: prompt fatique and understandability is an issue with adding new permission prompts

thomasN: +1 on trade-off with privacy; one successful strategy has been to look at what data has already been exposed

reillyg: the question on benchmark is a good one
… we expect developers already do this to decide what they can run
… if this is something we can provide them instead of getting them to run benchmark workloads that are wasteful
… the question is how to express capabilities as numbers which are difficults the same way hints are
… we've seen this as relatively successful in the WebGPU context and might be useful here for NPUs
… but unclear which numbers to provide

MikeW: do we need to expose OpsSupportLimits by processing unit?
… (as I commented on the issue https://github.com/webmachinelearning/webnn/issues/902)

<gb> https://github.com/webmachinelearning/webnn/issues/902

reillyg: this woudl be very helpful, but it's not information made available by platforms - e.g. CoreML doesn't provide stats on the capabilities of NPU
… similar situation in other platforms
… this would be great enhancement to the API

Anssi: how does this relate to #759?

<gb> Issue 759 MLOpSupportLimits should be opt-in with base functionality (by mwyrzykowski) [device selection]

MikeW: they're related but different; as reilly say, the challenge is making that information queryable

reillyg: for #759, we recently updated the WPT to differentiate required and optional tests to represent that idea of things developers can rely on or not
… not sure this has been reflected in the spec

<RafaelCintron> +1 to hints for now.

reillyg: beyond choosing devices, there is also a scheduling aspect to this
… e.g. if there are real-time vs non-real time workloads running in parallel, helping the UA to schedule with hints would be useful

dom: in addition to permission prompt, there's also discussion about integrating permission management with page embedded permission control (PEPC)
… not changing the discussion, it changes how this is embedded in the UX so we don't have prompt coming from nowhere
… for more advanced query API we need to make it in context of this new proposal for permission management

markus: if we have hints, how do we validate they work?

reilly: a developer can't measure how their app runs if we don't provide the metrics
… Phillis has a proposal to expose which device the model is running on

CPU fallback hint

Anssi: issue #900

<gb> Issue 900 CPU fallback hint (by anssiko) [device selection]

Anssi: the group has explored a "CPU fallback" hint, a flag to expose to web content whether a CPU fallback mechanism is active
… spun off from the "accelerated" hint, a feature discussion that landed

MarkusH: we have use cases where knowing if the workload will be accelerated is critical to deciding whether to run it or not
… we would want to abort if we detect cpu fallback before or after compilation
… before would help saving download cost

reillyg: the previous discussion about OS support for GPU/NPU devices is helpful here
… in general, the answer to "is CPU fallback active" before compilation is always "yes"
… it's always supported
… how do we help developers determine whether to use faster vs better model based on GPU availability

NIngxin: we should distinguish "cpu-only" vs "cpu fallback" - the latter is always available
… what you want here is to avoid accerelated=false
… we can set context.accelerated=false if we detect the GPU/NPU won't work

reillyg: one question is "do you have a GPU/NPU?" if not, this means we're on a CPU-only situation
… if it's about fallbacks - do we want to provide an option to fail compilation if it will end up running on CPU - but that only works after compilation which you want to avoid

reillyg: we should clarify that the issue about detecting whether a GPU/NPU is available - for a pre-compilation situation

<mgifford2> It's not just if a device has the GPU/NPU but if a user wants to have the LLM run on their device. It may be a matter of user preference, but also energy usage. Users may be happy running a GPU/NPU in some locations or times, and not others based on things like local energy costs or reliability. Just battery life as well.

ErikA: similar to the discussions in WebGL/WebGPU

MarkusH: we can always to try it and check whether it works well run on real time

ErikA: in WebGL, you can create a context that makes it fail if you hit performance challenges

<ErikAnderson> For context: https://developer.mozilla.org/en-US/docs/Web/API/HTMLCanvasElement/getContext#failifmajorperformancecaveat

reillyg: we should start simple "can it run fast at all", and look at more detailed evaluation in a later phase

Tarek: I had similar questions around concurrency: if the existing accelerated hardware is already used, should that be exposed to the app?

reillyg: a given app might run separate models/graphc rendering in parallel - we should help the app negotiate to figure which workloads to run where

Tarek: so the orchestration might happen on both sides?

reillyg: right - an app might have more workloads to run than are runnable in parallel on a given system

<gb> Issue 900 CPU fallback hint (by anssiko) [device selection]

MarkusH: another aspect is time-sensitivity: video frame needs to be processed in real time, when the answer to a chat-bot query to an llm is much less time-sensitive

MarkusH: a boolean flag on whether it is accelerated is probably a good enough starting point

Get device selection information after graph compilation

Anssi: issue #836 and PR #854

<gb> Pull Request 854 define graph.devices (by philloooo) [device selection]

<gb> Issue 836 Get devices used for a graph after graph compilation (by philloooo) [device selection]

Anssi: the group thinks we need the following two to advance:
… 1) strong use cases
… 2) check the API design is privacy preserving

Anssi: MarkusH from Google Meet share his key use case, adaptation

`graph.devices` could help identify:

a) what resources a misbehaving model is using, and

b) which models are candidates to stop that would help the situations

Anssi: MikeW commented:
… "Another way of achieving the same thing is the web app sorts its workloads in priority, terminating lower priority ones (1). Or some type of metric reporting that the model was stalled K ms waiting to run due to other work on the system and took S ms to complete (2)"

MikeW: the problem is that the information on which device has been selected isn't static
… a workload that has run on a GPU may run on the NPU the next run, or fallback to CPU
… I can see the value in expressing the graph can run on an accelerated unit, but reporting the last device on which it has run is not very reliable

reillyg: is there still value to report on which devices the workload might run? e.g. gpu or npu; would that be good enough for applications?

MikeW: could we just return that it can run accelerated vs a specific device value?
… the distinctions on specific hardware types are evolving, and it's not obvious it's needed for the app

MarkusH: I think that could work

Erik: an app author might want to know how much of the workload on which unit
… I'm not sure the proposal on the PR provides enough reliable context

dom: two aspects, things you want to operate and course-correct live and things you want to monitor to know if want to modify system later, is separation of concerns approach appropriate here?

MarkusH: when we detect that we're not operating in real-time compatible ways, we need to take action
… the proposal in the PR could help; an accelerated flag would probably suffice

phillis: if it's hybrid, what do we report?

dom: would an enum be actionable?

markush: in practice, it would depend on how much it runs on the CPU

reillyg: there is a cost in using multiple units (even GPU + NPU) - so maybe "hybrid" is worth reporting in general
… at some point, some of the performance detection can only be done by the app developers

RESOLUTION: Phillis to refine the proposal to reflect an accelerated status, with discussions on hybrid still TBD

<mgifford2> How much of this is inherent to hardware design? Will switching costs matter in 5 years? Probably. What influence might the W3C have in the future of what this technology makes available?

MLOpSupportLimits

Anssi: issue #759

<gb> #759

MikeW: we should define limits that are supported across all devices

reillyg: we have this in the test; a goal for us implementation-wise is to make sure that the implementation we have can implement all operators, and for those operators that can't be made optional

Customer feedback & collaborations

Anssi: customer feedback, including end-users, frameworks, independent software vendors, is extremely important throughout the process of developing new Web APIs, starting with use case identification, requirements gathering and hands-on feedback from early adopters, all the way to maintenance phase when large-scale deployment happens
… we have used a dedicated community-maintained repo, WebNN Awesome, to document various signals from customers and developers at large

webmachinelearning/awesome-webnn

Anssi: I recognize many customers are not comfortable to publicly speak for their future product's use of WebNN API at this time, so I ask for sensitivity in this regard toward that
… that said, we have some brave early adopters who have worked with us in public
… kudos to the Google Meet team and Markus in particular for sharing feedback, reviewing our proposals and also submitting new feature requests for considerations

RTC-style workloads with response time requirements

Anssi: issue #898

<gb> Issue 898 Support for workloads with response time requirements (realtime) (by handellm) [Agenda+]

Anssi: Markus provided customer feedback from Google Meet product where RTC-style workloads have strict response time requirements
… assumption is the system while not under load is able to execute the workload

MarkusH: I see a future where we run more and more concurrent ML workloads on our system
… if the system can't detect what's real-time or not, it may not be able to orchestrate it
… e.g. audio processing needs to be run within certain time requirements to the risk of audio glitches or robot voices
… if we can't rely on these deadlines being respected, this creates an adoption blocker
… the same is true (with a different scale) for video processing
… also, there is prioritization - not all audio processing may be as critical
… we've also documented situations of misbehaving concurrent workloads

markusT: it feels like a hard to address problem in general
… e.g. ONNX runtime doesn't have a sense of real time
… tasks get queued, so if it gets queued behing a slow task (e.g. an LLM request), you can't really accelerate this
… not sure there is a prioritization mechanism on all type of devices
… even getting this orchestrated on native is hard, because the frameworks don't support the infrastruture you would need to execute properly

Tarek: do we really want to do that in WebNN?
… we're starting from this situation of wanting to run concurrent workloads via a background utility
… should this be done by the app or in the backend? does it even make sense to run several things on a GPU

MarkusT: do you know how much the task will take of the available window?

MarkusH: on CPU, this is solved problem with OS priorities
… when workloads get interleaved on GPUs, there is an opportunity for prioritization

MarkusT: pre-emption is now available on GPUs, but it is much more expensive than a CPU
… but overall, this gets us back to my device selection issue
… e.G. audio processing you probably want on CPU, where the data is anyway
… conversely, video processing happens on GPU, and you'd want to use the device used to render the video as well

MarkusH: audio processing might be best run on NPU for power efficiency

MarkusT: that's where it' useful to know which devices are available, and support benchmarking

RafaelCintron: hints could communicate priorities, that would map to processing queues

dom: reflection, this need to orchestrate processing across latency and power efficiency, we find this around many APIs on the web platforms, and each time we create a hint we should align across APIs
… you don't want to switch from one GPU to another GPU, you want continuity, but how to describe that in a declarative way is the question
… these are more general questions, Google Meet is a good use case to look at RTC-application requirements

MarkusH: I was discussing with Youenn of the concept of worker priority that was proposed by Intel a couple of years ago
… e.g. an "audio" worker priority would be exposed to WebNN
… and influence how that job would run

Web WOrker Quality of Service breakout presentation at TPAC 2023

Implementation plans and trials

Anssi: in this session we'll discuss and share the latest news on the next step for implementations, Origin Trial or equivalent, new requirements or feedback from updates in backends, frameworks
… but first, we kick off with exciting demos to raise your appetite!

WebNN Developer Preview demos

[showing running WebNN Stable Diffusion Turbo both on GPU and NPU]

[showing WebNN Segment Anything demo running on GPU and on NPU]

[showing WebNN Whisper Base on GPU and NPU]

WebNN via Transformers.js

[showing Background removal based on MODnet, demo hosted on hugging face, on GPU & NPU]

[showing real-imte object detection w/ Yolo12n]

[real-time depth estimation w/Depth anything v2]

[demo of background blur done with WebNN on a full WebGPU pipeline, with 23% improved performance and 17% lower power consumption]

Anssi: thank you Ningxin for these compelling demos!

Browser vendors' trials

Anssi: we've discussed on our telcons that Origin Trial in Chrome is getting closer, latest discussion:

https://www.w3.org/2025/10/23-webmachinelearning-minutes.html#1faa

Anssi: we also discussed Edge works in upstream with only a small 5-10 days delay and will launch an Origin Trial in sync
… more information about Origin Trials will be made available at:

Chrome Origin Trials

Edge Origin Trials

reillyg: I imagine this landing in the 2nd or 3rd release of the new year in Chrome OT

Horizontals

Anssi: in this session we get to know experts behind horizontal groups

Ethics

Anssi: for ethics, I've proposed to make this group's Ethical Principles for Web Machine Learning a joint deliverable with the Web & AI Interest Group, a group that is being proposed
… by doing this, we can tap into the expertise of that Interest Group to help advance this important deliverable on the W3C Note track
… we currently refer to this doc in the WebNN spec Ethical Considerations

https://www.w3.org/TR/webmachinelearning-ethics/

https://www.w3.org/TR/webnn/#ethics

dom: ethics has not received a lot of bandwidth which is why we propose to make it a joint deliverable with Web & AI IG, the document was written 2022-23 so long time ago considering the rate of development in AI space

dom: also Ethical Web Principles has been endorsed as W3C Statement

Sustainability

Anssi: I've asked Mike Gifford, co-chair of the Sustainable Web IG to talk about the work done in that group

https://www.w3.org/groups/ig/sustainableweb/

Anssi: per TAG review feedback, we're expected to evaluate sustainability impact of WebNN, see issue #861

<gb> Issue 861 Evaluate sustainability impact (by anssiko) [tag-needs-resolution] [Agenda+]

<mgifford2> https://www.w3.org/TR/web-sustainability-guidelines/

<mgifford2> w3c/sustainableweb-wsg#139

<gb> Issue 139 Adding a comment about what is or isn't included with AI (by mgifford) [enhancement] [editorial]

MikeG: we're an environmental and climate crisis which we need to integrate in our work
… the goal of the Web Sustainability Guidelines is to create a Web standard that other institutions can use as a guidelines to evaluate how sustainable their technologies are
… the size of average Web page has grown beyond the scale of the information it's providing
… which relates to Web performance, although we have considerations that are completely separate - e.g. water consumption
… We're here because AI is changing a lot for the Web; we're seeing agentic browsers, the raise in AI in everything whether you want it or not, with a huge environmental impact, with data centers growing impact on electricity, water, sound
… we're interesting in evaluating overlap between our groups, the impact of decentralizing AI inferences to devices on environmental impact given their lower optimization compared to data center
… we have a few questions:
… - what advice can you give us as we're starting to write up guidance on accessibility [suspect AI was meant AI]

<dom> s/.AI//

Anssi: what is the best way for participants of this group to help with this? github repo?

mikeG: yes

Anssi: are there AI-related issues we can help with? initially AI wasn't really part of the scope as I understand it

MIkeG: right - we can't not address this given the impact of AI
… our guidelines are expected to address different context and audiences
… we're not sure yet on whether to include AI in a cluster or distribute it across the document

Anssik: for this group to help, having AI-focused content would make it easier

MikeG: we have lots of infrastructure to help navigate the guidelines and issues through a well-defined taxonomy, which will help with this
… there is also the question of data centers on which we could use expertise from people here

dom: sustainability is currently and IG that's working on a note and the direction is toward a horizontal group
… horizontal definition is more of a cultural one, ethical web considerations tell us to consider sustainability

MikeG: how does your group deal with a fast-evolving ecosystem such as AI?

Anssik: we try to find the right level of abtractions that stand the test of time, as Web standards have tried to do
… similar to the discussion about to what extent the NPU/GPU distinction matters

<AlexDawson> W3C already has Societal Impact self-review, there is scope for a potential self-review for sustainability in the future.

reillyg: we also depends on what developers will want to use to provide the best UX
… so we're more reactive where the sustainability work would be more proactive in pushing in a given direction

MikeG: aligning incentives towards good sustainability is a key challenge we face
… Small vs Large Language Models: the former seems more environmental friendly; but will that distinction remains relevant over time?

Anssi: the Mobile Web Best Practices document had that very issue

Tarek: re SLM, at Mozilla the definition we used a year ago no longer works today
… we're looking at device tiers: non-capable, device with certain capabilities, high end devices
… we've found that more robust over time

MikeG: any suggestion on how to classify models instead of devices?

Tarek: anything that doesn't spit out a continous stream of tokens is a SLM

Sushanth: we put the boundary at 7B parameters

Tarek: but it's at risk of changing in a few months

Thomas: I don't think the # of parameters in a guideline context: the guideline should be about "use the smallest possible model"
… with the caveat that an already-available model on device might be a better option

Anssi: my mental model to the model selection problem: use the "right tool for the right job" but it is complicated because the diversity to toolboxes available to people

anssik: re model selection, this should be about selecting the right tool to the right job, but it is a complicated evaluation to make given the variety of options available

Privacy and Security

Anssi: late last year, the Privacy Interest Group was launched replacing the Privacy Working Group
… what's new in this transition?

https://www.w3.org/2024/10/wg-privacy-charter.html

Tara: the transition of the IG to WG hasn't really changed much in terms of the review work

tara: Simone and I are going to run a joint presentation

anssik always struggle a bit to delineate between privacy and security

tara: they do have a lot in common
… we have specialized guidance, but this shouldn't be a source of concern on your end

Anssi: Security Interest Group was recently launched to reinvigorate work to advise groups developing standards on how to avoid and mitigate security issues

Slideset: https://docs.google.com/presentation/d/11m1TXLVzhnIEimyqIjgs0VTXhhA4wWIGyATEw664Ei0/edit (archived PDF copy)

ThomasN: re fingerprinting, one of the perennial that keeps popping is that there is already so much entropy that it's not obvious how much of it can still be mitigated
… is this a tractable problem, and something that is worth spending time mitigating at the spec level?

Tara: I think we're pushing towards a better space and so we feel it's worth considering the trade-offs that keep that path open

ThomasN: it's hard to evaluate the cost in developing the API and more importantly the ability of developers to fulfill their use cases

christianliebel: the APIs we build in the CG/WG are on-device

Wrap up

Anssi: thank you everyone for your active participation and productive discussions
… this day was packed and we managed to finish with gusto!

christianliebel: how different a trusted executed environment on cloud would be from the security/privacy perspectives?

Anssi: special thank you to our guests Mike, Tara, Simone, who joined to share important work happening across horizontals
… also huge thanks to Ningxin & team for the case study and compelling demos that both inform our future direction and demonstrate the exciting web experiences we already enable today with WebNN
… interested folks are welcome to join us for a dinner
… we're quite many, so the plan would be to meet in the Portopia Hotel (adjacent to the Kobe International Conference Center) lobby at 18:15 to coordinate on transport and restaurants, likely split to multiple based on preferences

– DRAFT –
Web Machine Learning WG F2F – 10 November 2025

09 November 2025

Attendees

Meeting minutes

Welcome

Intros

Agenda bashing

Charter orientation

Spec orientation

WebNN Small Language Model (SLM) Performance Optimization Case Study

New features

WebNN Small Language Model (SLM) Performance Optimization Case Study

Core operator set

Support flexible input sizes

API to Separate Graph Building from Weight Loading to Reduce Peak Memory Usage

Device selection, state of the union

Device selection criteria for usecase-driven scenarios

CPU fallback hint

Get device selection information after graph compilation

MLOpSupportLimits

Customer feedback & collaborations

RTC-style workloads with response time requirements

Implementation plans and trials

Browser vendors' trials

Horizontals

Ethics

Sustainability

Privacy and Security

Wrap up

Summary of resolutions