Meeting minutes
<DenisD> Remote + Denis_DIDIER
<gb> https://github.com/webmachinelearning/meetings/issues/35
Repository: webmachinelearning/webnn
Welcome
Anssi: welcome to the W3C Web Machine Learning WG F2F at TPAC 2025, this is our second physical F2F
… I'm Anssi Kostiainen, Intel, the chair of the WG
… with me is Dom, Dominique Hazael-Massieux, W3C Staff, helping run the meeting smoothly
… again great to see so many folks here in person and new people outside the usual WG participants, including participants and guests who represent Japanese W3C members and organizations
… Arigato gozaimasu!
… this WG has continued to grow rapidly since the last year, we have all major browser vendors on board and new folks are joining
… the YoY groth is around +30% in both organization and participants, for both this WG and its sister CG
… a few new members who joined the WG since last F2F:
… Hugging Face
… Qualcomm
… NVIDIA
… ARM
… Shopify
… we are working at the intersection of Web & AI/ML technologies during this time of exponential growth in AI and we've luckly to have such a diverse group of experts onboard:
… all browser vendors, OS vendors, major semiconductor companies invested in AI, major platform providers, ISVs, distinguished researchers from the academia, individuals, and more
… if you registered as a WG participant, please join us at the table
… observers are welcome to join the table too subject to available space
Anssi: we use Zoom for a hybrid meeting experience, please join using the link in the meeting invite
Anssi: we use IRC for official meeting minutes and for managing the speaker queue
… please join the #webmachinelearning IRC channel, link in the meeting invite and agenda:
https://
webmachinelearning/
<gb> Issue 35 WebML WG/CG F2F Agenda - TPAC 2025 (Kobe, Japan) (by anssiko)
Anssi: to put yourself on the queue type in IRC "q+"
… during the introductions round, we'll try to record everyone's participation on IRC with:
… Present+ Firstname_Lastname
… please check that your participation is recorded on IRC to we're able to acknowledge your presence in the meeting minutes
Intros
Anssi: since we're many again, we'll do a quick round of introductions, 15 seconds each, full name, affiliation and key interest
Dwayne: WebNN spec editor, with a focus on new operators, at Microsoft
Rafael: also at Microsoft, on Edge browser, working on all things AI and graphic rendering
MikeW: at Apple, involved in WebNN and also WebGPU
Denis: involved in Sustainable Web Guidelines
Phillis: at Google, working with Reilly on WebNN implementation on Chromium
<DenisD> Introduction : Denis DIDIER, from France - Company ITHENKA, Contributor to W3C Sustainable Web Guidelines, and Sustainable AI with french non-profit Institute for Sustainable IT.
Reilly: implementing WebNN in Chromium
Shushan: at Microsoft, built-in AI
ErikA: manager on edge browser team
Ugur: working on AI solutions for the construction industry as chief AI officer
AndrewW: at ARM, involved at our open source and standards strategy team
Tarek: from Mozilla on the Firefox AI team
Markus: from NVidia, devtech supporting ISV integrating ML in their apps, getting involved in standardization to help their lives
Ningxin: co-editor of WebNN spec at Intel
Dom: staff contact for the group and looking at impact of AI on the Web
@@@: at Google looking at integration of WebNN in Chromium
MarkF: at Google, working on Chrome on AI & agentic features, involved on WebMCP
Kenji: chrome, built-in AI
Ben: Chrome, similar to Mark on agentic AI
@@@2: Chrome team, WebMCP
Thomas: devrel at Chrome
Ali: program manager in Google, supervising ML/GPU work
<kbx> @@@2 is Rob Kochman.
DavidEzell: Connexus - excited by this group; we're a standards body hoping to ruborchrage retail vendors with our standards
Brian: @@@
YutaHagio: working for NHK, Japanese broadcaster
ChiaraCerretti: @@@
GuidoU: Google, WebRTC APIs in Chrome, exploring application of AI
MarkusH: Google Meet, interested in AI/WebRTC
Diogo: Brazilian W3C Office
SamGoto: Google Chrome, platform APIs
@@@: Meta browser
Masao: @@@
Agenda bashing
Anssi: the F2F agenda was built collaboratively with you, the WG participants and is published on GH:
webmachinelearning/
<gb> Issue 35 WebML WG/CG F2F Agenda - TPAC 2025 (Kobe, Japan) (by anssiko)
Anssi: any last-minute proposals or updates?
<kbx> Meet after the meeting for meat or no meat.
Charter orientation
Anssi: we have two groups, Web Machine Learning Working Group (WG) and Community Group (CG)
… WG standardizes Web APIs for on-device inference using CG incubations as its seeds
… deliverables: WebNN API, Ethical Principles
<dom> s|2023/04/2025/03|
Anssi: we're looking to make the Ethical Principles as a joint deliverable with the proposed Web & AI Interest Group
… this informative document is a reference from the WebNN API spec
… CG is a group where new ideas are discussed, explored and incubated before formal standardization
… past CG spec incubations include e.g. WebNN, Model Loader
… since last year, we've expanded the scope of the CG to built-in AI APIs and agentic web capabilities
Anssi: current CG deliverables:
… Prompt API
… Writing Assistance APIs
… Translator and Language Detector APIs
… Proofreader API
… WebMCP API
… the CG technical scope is higher-level task-based APIs and agentic web feature WebMCP
… while the WG technical scope is a lower-level WebNN API, the graph builder abstraction
Anssi: the WG and CG work closely together and coordinate with other W3C groups, for example:
… - WebGPU WG/CG for WebNN-WebGPU interop
… - Wasm CG
… - WebRTC for media processing-related integrations
… - AI Agent Protocol Community Group for agentic protocols
… - And with horizontals: privacy, security, a11y, also emerging sustainability and ethics
dom: we operate under W3C CoC and antitrust guidance for W3C, under W3C Patent Policy
Spec orientation
Anssi: we have scheduled time before our first break to do a triage pass through open issues
… the plan is to collaborative look at our backlog or issues and PRs to:
https://
Anssi: - focus on breaking changes
… - check priorities
… - set next steps for the issues
… let's use IRC to queue proposals, examples:
<gb> Issue 573 Core operator set (by philloooo) [question] [opset] [Agenda+]
<gb> Issue 883 Support flexible input sizes (by huningxin) [feature request] [operator specific] [Agenda+]
<gb> Issue 861 Evaluate sustainability impact (by anssiko) [tag-needs-resolution] [Agenda+]
Anssi: this is your last-minute opportunity to influence today's agenda
… we'll first record triage results on IRC during the first ~15 mins and then review as a group and continue discuss and refine on the hallway track with coffee/tea
<gb> Issue 226 Integration with real-time video processing (by dontcallmedom) [use case]
<gb> Issue 6 Custom operations (by dsmilkov) [v2] [device selection]
<gb> Issue 763 Request standards positions from Mozilla and WebKit (by reillyeon) [process]
<gb> Issue 807 Caching mechanism for MLGraph (by anssiko) [question] [feature request]
<Zakim> anssik, you wanted to propose next steps for #573 and to discuss priority of #883 and to bump the priority of #861
<gb> Issue 861 Evaluate sustainability impact (by anssiko) [tag-needs-resolution] [Agenda+]
<gb> Issue 573 Core operator set (by philloooo) [question] [opset] [Agenda+]
<gb> Issue 883 Support flexible input sizes (by huningxin) [feature request] [operator specific] [Agenda+]
<Zakim> reillyg, you wanted to discuss the priority of #226 versus WebGPU interop work and to discuss whether we have actionable next-steps on #6 and to discuss (likely in the context of the OT discussion) next steps on #763 and to discuss #807 ahead of prototyping work that I believe will be starting soon and to discuss defining a process for the WG to accept or modify operators.
<gb> Issue 226 Integration with real-time video processing (by dontcallmedom) [use case]
<gb> Issue 6 Custom operations (by dsmilkov) [v2] [device selection]
reillyg: identified a couple of issues worth discussing, incl introducing future work
… issue #6 on custom operator support
… we're getting close to origin trial chromium, so we should request standards position from WebKit and Mozilla
… without discussing operator support in deep detail in this meeting, it might be useful discussing a process of adopting new operators or modifying existing operators
<ningxin> This is what we have for op change process: https://
WebNN Small Language Model (SLM) Performance Optimization Case Study
Slideset: https://
New features
WebNN Small Language Model (SLM) Performance Optimization Case Study
Anssi: I've asked Ningxin to present a WebNN Small Language Model (SLM)
Performance Optimization Case Study conducted by a group of our engineering team
… thank you Yuheng, Wei, Wanming, Jonathan, Ningxin for producing this case study to inform WebNN new features discussion with considerations for topics such as:
… - WebNN SLM support and challenges, operators' fusion, on-device KV, tensor binding, dynamic shape
… after this case study, we will proceed with discussion
ningxin: looking at how to optimize performance for a small language model, based on QWen small model
… case study conducted by a Team at Intel
Ningxin: we reused the model from the native ORT-GenAI project
… in this experiment, we focused on the WebGPU EP
… and ran the same model in the webnn-based stack using the WebGPU EP as well
Ningxin: looking at the contribution to inference time of key operators, starting with the native stack
Ningxin: the ONNX macro ops get decomposed in WebNN operators
… GroupQueryAttention requires 24 WebNN operators
Anssi: thank you Ningxin & team for this insightful case study
… Accessing to optimized macro ops is key for SLM performance
… MatMulNBits, GQA etc.,
… Support in spec or fusion in implementation?
… Support dynamic input shapes?
… Allow same tensor for input and output?
… Decouple tensor binding and graph dispatch?
reillyg: the biggest question on all these API change proposals is doing the research on what's possible on the current WebNN implementation
<Zakim> reillyg, you wanted to discuss next steps based on SLM presentation.
reillyg: we investigated dynamic shape, tensor binding (backends have support for this)
… this would be useful also for real-time application
… same-tensor in input/output: the main question is whether the graph supports it
… maybe a binding step could be used for validation
Dwayne: impressive speed up identified in the case study
Rafael: +1 on the next steps identified
Ningxin: we could look at another backend, like LiteRT
… and compare it to using ONNX
… both would be based on the same PyTorch model
… MatMulNBits allow to set accuracy which the underlying implementation can use to accelerate the inference
… not clear how to encode this when doing fusion
Core operator set
Anssi: #573
<gb> Issue 573 Core operator set (by philloooo) [question] [opset] [Agenda+]
Anssi: in the case study discussion we learned many of the SLM building blocks are key for performance:
… - MatMulNBits
… - GroupQueryAttention (GQA)
… - SkipSimplifiedLayerNorm / SimplifiedLayerNorm
… - RotaryEmbedding
… the question to the group is, should we support them in spec or with fusion in implementation?
WebML WG Teleconference – 9 October 2025
Anssi: NVIDIA team reported they're collecting all the ops that'd benefit from being in the set, one class is various attentions, also gathers, MoE, TopK, and looking for other ops that'd benefit from not being composed
Markus: operators proliferation - some operators can't be decomposed into existing operators
… part of what we need to consider is whether operators we expose in the spec are going to remain useful for long enough
<DwayneR> There's a 3rd possibility too (not just built-in operator vs recognized fusion) - support subgraph composition, so that these complex operators (they are really entire graphs) are not permanently baked into the API, but still can be recognized easily and passed through to the backends.
Markus: can we enhance WebNN to be a multi-layers, so that we can have decomposition happen at the discretion of the browser or the backend, carrying over optimization down to the hardware as necessary
… which operators are complex enough and can't be decomposed into existing operators
… how to expose compound operators in WebNN
Dwayne: TopK feels like a primitive operator that should be added to WebNN
… MatMulNBits on the other hand is a subgraph
Anssi: Dwayne had done some extensive research on this topic and re-raised the concept of aggregate operators via subgraphs
webmachinelearning/
<gb> Issue 573 Core operator set (by philloooo) [question] [opset] [Agenda+]
Dwayne: another possibility is to support the concept of subgraphs that can be referred to as operator later, as Ningxin described at TPAC last year
… we don't have a specific issue for that atm
… I'll open one
Anssi: Markus, could you capture your feedback on github? maybe on a meta issue around optimization
Markus: we fully agree with what has been described so far
Markus: if we would have sub-graphs, it would also optmize the number of nodes and speed up computation, as Ningxin was describing
<dom> s|ningxin_slides|https://
RESOLUTION: open a new issue for aggregate operators via subgraphs
Support flexible input sizes
Anssi: issue #883
<gb> Issue 883 Support flexible input sizes (by huningxin) [feature request] [operator specific] [Agenda+]
Anssi: this issue was initiated due to the fact many models e.g. vision models require flexible input sizes that are determined at inference time
… a few scenarios mentioned:
… - input images with different resolutions for vision models
… a concrete example would b MODNet, a model for real-time portrait matting
https://
Anssi: - transformers with arbitrary input lengths
… when using KV cache, need to increase the KV cache length by 1 per each inference
… - speech recognition
… example Whisper encoder with arbitrary input lengths
… decoder increases the KV cache length by 1 per inference
… - LLMs
… for example Qwen2.5-0.5B-Instruct also increases KV cache length at inference time
… "Lack of the support for flexible input sizes increases the complexity of using WebNN for those models."
… - complexity reduction: now need to modify the model and fix input size pre compile
… - at inference time: now need to resize the image input or pad input
… - native support: flexible input sizes are already supported by native frameworks
Anssi: Dwayne responded with considerations:
webmachinelearning/
<gb> Issue 883 Support flexible input sizes (by huningxin) [feature request] [operator specific] [Agenda+]
dwaye: [voicing his thoughts written up in webmachinelearning/
… a step between binding and building the graph would help
Anssi: per our latest discussion, we're now exploring the following as a group:
anssik: reillyg raised questions around how this would get implemented on existing backends and performance implication
Anssi: - how this will be implemented by backends
… - what is the role of WebNN in this decision, the framework could build multiple graphs
… - understand performance bottlenecks (of multiple graphs)
reillyg: all the backends we target some form of dynamic shape
… two APIs part: where the dynamic shapes are idenitifed in the graph? (either arbitrary or among a set of well defined)
… what API to switch to dynamic?
… then figure how to translate that with the backends while considering the performance impact
Anssi: MarkusT shared there are three different types of dynamic shapes usually used in neural network models:
… - Completely Unknown Sizes
… - Symbolic Sizes
… - Tensor-Derived Sizes
TensorRT: Working with Dynamic Shapes
Markus: [voicing webmachinelearning/
<gb> Issue 883 Support flexible input sizes (by huningxin) [feature request] [operator specific] [Agenda+]
Markus: Symbolic Sizes feels like the most interesting for WebNN
… it might be useful to have a way to define the ranges of sizes to optimize backend preparation
… anyone feels symbolic size wouldn't work for them?
Ningxin: if we define symbolic size, does that include calculation on these symbols?
Markus: that would need more discussion; the more maths we allow, the more complexity
Ningxin: so we should start with the simplest approach and iterate as we identify the needs
Markus: dispatch would be the phase where the dimensions would be updated
Ningxin: this would match Chromium's implementation
reillyg: I wonder if in most cases, the need to express complex mathetmical functions for shapes goes away
… with the risk that in some intermediate nodes it would not be possible to express the computed bounds of an operator
… there are two separate pieces: API shape and validation
… the WebNN API only has developers provide shapes for inputs
… the API then provides back to developers the shapes of intermediate nodes, computed based on the operators
… with static shapes, this can be computed statically at the time of graph building
… if we move to dynamic, we either have to say "we don't know" or express it with symbols
Markus: just saying "it's dynamic" feels reasonable
reilly: we use the computed shape of a graph as part of the validation internally
… I would have to check if dynamic shape, this would have to be done by the developer themselves
DwayneR: we should distinguish flexible model size and dynamic shape
Markus: that matches "tensor-derived sizes" in my taxonomy
Ningxin: some prototyping & study would usefully inform this feature
reillyg: looking at the existing graph validation code and how much more complicated it becomes with dynamic shape
RESOLUTION: study more backends and do prototyping before more formally specifying solution
Ningxin: once we have that prototype in the browser, we should look at whether it makes it easier to deploy existing language models without modification
API to Separate Graph Building from Weight Loading to Reduce Peak Memory Usage
Anssi: proposal issue #901 by Markus
<gb> Issue 901 Proposal: API to Separate Graph Building from Weight Loading to Reduce Peak Memory Usage (by mtavenrath)
Anssi: a proposal to reduce peak memory usage
… using Stable Diffusion as a reference, CPU memory used during graph building is more than 3x the actual model size using dGPU, also high on iGPU
… proposal to introducing an API that splits graph creation from weight passing i.e. loading the constant data
… to enable streaming weights directly into the graph during initialization
… do all current WebNN backends support a weightless graph creation, where all tensor shapes and data types are known, but the actual weight data is not provided until a later step?
… using dGPU limit the peak CPU memory overhead to 1x-2x the size of the largest single tensor
… using iGPU no temporary CPU-side storage would be needed for the "upload" as it's shared memory, reduce the total peak CPU memory consumption down to roughly Model Size + Max Single Tensor Size
Markus: my experience is that even on desktops/notebooks still have limited memory, typically 16GB
… even more so on mobile devices
… right now, models in WebNN can't make use of all available memory due to the memory used for loading them
reillyg: model caching is related to this: how do we get the model weights the developer provide to the underlying framework the most efficiently possible?
… the frameworks want to see the weights to repack them to match memory layout
… we are constrained having weights available during graph building
… but clearly graph building isn't memory efficient for now
<ningxin> Would constant tensor can solve this issue? https://
reillyg: more an implementation issue I believe
… some of the changes we need for caching would help address this performance issue
… maybe from an API perspective would be for situations with a very large constant we wouldn't want to load in the memory at all
… which could be improved by streaming the constant
Markus: do we have the list of operators that would require the constant to be known at build time?
reillyg: we can get that list of operators
… it's a constraint we get from backends we wish we didn't have
Markus: maybe this is something we can reach out to backend developers to change
ningxin: would the constant tensor help with this? https://
… the issue is with some underlying AI runtimes don't support this; with DirectML, we can do this, but on ONNX, AFAIK there is no way to do this, we have to put everything on CPU during session creation time
… in terms of frameworks, ONNX runtime web needs all the weights to be on CPU
… we can have the API shape, but it needs adjustment both in underlying runtimes and frameworks
Markus: right, this is an ecosystem effort in which WebNN is at the center
ack +
Rafael: in the browser, this is a multiprocessor architecture, the model MUST run in a different thread for security
… being able to share memory across processes would be good for performance, but challenging for security
… maybe not insurmountable, but there was discomfort
Markus: zero-copy would be the dream, but reducing from 10 copy to much fewer is critical to make this usable on real world conditions
reillyg: looking at the implementation sketch Markus put together, that almost match how we're doing it: when a dev gives us a constant, we create a handle to it, we start uploading that constant to the backend, and let the developer continue building the graph
… the problem right now is less than the API doesn't allow you to only keep roughly the largest tensor worth of memory - it does
… but all the implementations on the browser side and on the JS side aren't handling this very well
… ONNX runtime web requires everything to be in JS memory
… similarly today when we create a constant, we keep it in the memory in the browser - but we don't have to
… what Markus describes is how we intend the current API to work, but that's not how implementations exist today
Anssi: is there any spec change that we should make out of this conversation? any normative change for WebNN to enable this optimization?
reillyg: the only two things we might change: if we have an issue with very large constants, we may want to add a streaming constructor for constants
… and a feedback mechanism to let developers know when they can start loading the next one, with a backpressure mechanism to manage peak memory
Dom: any cooperation we should facilitate with backends/frameworks?
reillyg: I assume the ONNX Web runtime is aware of the memory issue
Ningxin: the main issue I think is on the backend side, and whether this would work with the various hardware EPs
… on the JS framework, we can probably have a solution
reillyg: adding a streaming constructor would also open the door to the backpressure feature I was describing
RESOLUTION: explore streaming constructor for constants
RafaelCintron: +1 on the importance on getting this fixed to be clear
… I know the ONNX runtime is trying to fix very similar issues
… this will also be needed for WebGPU interop
Device selection, state of the union
Anssi: a bag of issues
… we have explored API surface-level enhancement for both "before graph compilation" tied to MLContext and "after graph compilation" tied to MLGraph object
… recently we reached consensus to add a simple accelerator selection mechanism
… issue #815 was addressed by PR #895
<gb> #895
<gb> #815
Anssi: the minimal design the group landed on is an `MLContext.accelerated` boolean:
interface MLContext {
undefined destroy();
+ readonly attribute boolean accelerated;
readonly attribute Promise<MLContextLostInfo> lost;
};
Anssi: a corresponding explainer update was #884
Anssi: we spun off issues for further discussion:
… #897 to device "underlying execution device" concept
<gb> Issue 897 Define "underlying execution device" concept (by anssiko) [device selection]
Anssi: #900 for CPU fallback hint
<gb> Issue 900 CPU fallback hint (by anssiko) [device selection]
Anssi: #902 usecase-driven scenenarios
<gb> Issue 902 Device selection criteria for usecase-driven scenarios (by fdwr) [device selection]
Anssi: we also have a spec issue #836, PR #854 and prototype implementation for `MLGraph.devices` API
<gb> Pull Request 854 define graph.devices (by philloooo) [device selection]
<gb> Issue 836 Get devices used for a graph after graph compilation (by philloooo) [device selection]
Anssi: the latest on this is MarkusH and MikeW are exploring use cases with this design
… privacy is the key concern with this proposed API enhancement
Anssi: issue #759
<gb> Issue 759 MLOpSupportLimits should be opt-in with base functionality (by mwyrzykowski) [device selection]
Anssi: this proposal from MikeW for providing an API for listing operator support limits is informed by a similar API in WebGPU:
Anssi: the proposed MLOpSupportLimits API returns all available devices with their operator support limits
… using this information, the web app can choose one of them to initialize a context with
Device selection criteria for usecase-driven scenarios
Anssi: issue #902
Anssi: any device selection feature we design should be motivated by a real-world app scenario / use case
Dwayne: no concrete proposals here
… the question is how to find the right balance between leaving more freedom to the UA and allowing situations were more device control is required
Markus: the problem is made complex because there are not only CPU/GPU/NPU, but several GPUs, NPUs, sometimes of different vendors
… WebNN is a really good target for vendors seeking to deploy on the Web interoperably
… one situation that is challenging is when they need to run multiple models at the same time
… when professional users have multiple powerful GPUs, we wouldn't want the privacy protections to make it impossible to fully take advantage of their hardware
… I wondered if a permission prompt similar to camera/mic could be acceptable, which would then get access to full query of devices while avoiding slient fingerprinting
Rafael: WebGL and WebGPU have a way to pick a specific highperformance adapter
… with a restriction on iframes
… wrt prompting, neither WebGL or WebGPU have a prompt - how do you handle the situation where the user say no because of prompt fatigue
… fingerprinting is a real issue - WebGL has been massively used for fingerprinting based on telemetry
… I'm OK with allowing access to high performance GPUs, and maybe consider a permission prompt for super advanced use cases
reillyg: +1 to Rafael
… a solution where by default you the GPU the browser identifies the best
… I don't think the current WebGPU implementation in Chromium allows to use multiple GPUs
… Maybe WebNN should allow to query for NPUs
RafaelCintron: as far as Chromium is concerned, high-perf request is only supported on Mac
… not supported on Windows - maybe coming in the future
<DwayneR> Itadakimasu 👋.
Erik: how much do we need to explore the permission prompt? vs an enterprise policy optional API?
Markus: my main point is to make sure we consider scenarios with more complex device selection then just per type
… both because multiple devices of a given type might exist, or because you need to keep a particular job on a device where another job is happening (e.g. decoding)
erik: how much does this need to be driven by the app vs via hints?
Markus: I'd be fine with hints, but I'm skeptical they'll suffice
… Another case might be benchmarking - including done by the app for device selection
dom: on the web platform, always need to balance use cases vs. privacy, 80/20 rule, hints vs direct control needs this consideration
… we're might be adding a huge new fingerprint surface
Markus: we'd put this behind a permission prompt
dom: prompt fatique and understandability is an issue with adding new permission prompts
thomasN: +1 on trade-off with privacy; one successful strategy has been to look at what data has already been exposed
reillyg: the question on benchmark is a good one
… we expect developers already do this to decide what they can run
… if this is something we can provide them instead of getting them to run benchmark workloads that are wasteful
… the question is how to express capabilities as numbers which are difficults the same way hints are
… we've seen this as relatively successful in the WebGPU context and might be useful here for NPUs
… but unclear which numbers to provide
MikeW: do we need to expose OpsSupportLimits by processing unit?
… (as I commented on the issue https://
<gb> https://github.com/webmachinelearning/webnn/issues/902
reillyg: this woudl be very helpful, but it's not information made available by platforms - e.g. CoreML doesn't provide stats on the capabilities of NPU
… similar situation in other platforms
… this would be great enhancement to the API
Anssi: how does this relate to #759?
<gb> Issue 759 MLOpSupportLimits should be opt-in with base functionality (by mwyrzykowski) [device selection]
MikeW: they're related but different; as reilly say, the challenge is making that information queryable
reillyg: for #759, we recently updated the WPT to differentiate required and optional tests to represent that idea of things developers can rely on or not
… not sure this has been reflected in the spec
<RafaelCintron> +1 to hints for now.
reillyg: beyond choosing devices, there is also a scheduling aspect to this
… e.g. if there are real-time vs non-real time workloads running in parallel, helping the UA to schedule with hints would be useful
dom: in addition to permission prompt, there's also discussion about integrating permission management with page embedded permission control (PEPC)
… not changing the discussion, it changes how this is embedded in the UX so we don't have prompt coming from nowhere
… for more advanced query API we need to make it in context of this new proposal for permission management
markus: if we have hints, how do we validate they work?
reilly: a developer can't measure how their app runs if we don't provide the metrics
… Phillis has a proposal to expose which device the model is running on
CPU fallback hint
Anssi: issue #900
<gb> Issue 900 CPU fallback hint (by anssiko) [device selection]
Anssi: the group has explored a "CPU fallback" hint, a flag to expose to web content whether a CPU fallback mechanism is active
… spun off from the "accelerated" hint, a feature discussion that landed
MarkusH: we have use cases where knowing if the workload will be accelerated is critical to deciding whether to run it or not
… we would want to abort if we detect cpu fallback before or after compilation
… before would help saving download cost
reillyg: the previous discussion about OS support for GPU/NPU devices is helpful here
… in general, the answer to "is CPU fallback active" before compilation is always "yes"
… it's always supported
… how do we help developers determine whether to use faster vs better model based on GPU availability
NIngxin: we should distinguish "cpu-only" vs "cpu fallback" - the latter is always available
… what you want here is to avoid accerelated=false
… we can set context.accelerated=false if we detect the GPU/NPU won't work
reillyg: one question is "do you have a GPU/NPU?" if not, this means we're on a CPU-only situation
… if it's about fallbacks - do we want to provide an option to fail compilation if it will end up running on CPU - but that only works after compilation which you want to avoid
reillyg: we should clarify that the issue about detecting whether a GPU/NPU is available - for a pre-compilation situation
<mgifford2> It's not just if a device has the GPU/NPU but if a user wants to have the LLM run on their device. It may be a matter of user preference, but also energy usage. Users may be happy running a GPU/NPU in some locations or times, and not others based on things like local energy costs or reliability. Just battery life as well.
ErikA: similar to the discussions in WebGL/WebGPU
MarkusH: we can always to try it and check whether it works well run on real time
ErikA: in WebGL, you can create a context that makes it fail if you hit performance challenges
<ErikAnderson> For context: https://
reillyg: we should start simple "can it run fast at all", and look at more detailed evaluation in a later phase
Tarek: I had similar questions around concurrency: if the existing accelerated hardware is already used, should that be exposed to the app?
reillyg: a given app might run separate models/graphc rendering in parallel - we should help the app negotiate to figure which workloads to run where
Tarek: so the orchestration might happen on both sides?
reillyg: right - an app might have more workloads to run than are runnable in parallel on a given system
<gb> Issue 900 CPU fallback hint (by anssiko) [device selection]
MarkusH: another aspect is time-sensitivity: video frame needs to be processed in real time, when the answer to a chat-bot query to an llm is much less time-sensitive
MarkusH: a boolean flag on whether it is accelerated is probably a good enough starting point
Get device selection information after graph compilation
<gb> Pull Request 854 define graph.devices (by philloooo) [device selection]
<gb> Issue 836 Get devices used for a graph after graph compilation (by philloooo) [device selection]
Anssi: the group thinks we need the following two to advance:
… 1) strong use cases
… 2) check the API design is privacy preserving
Anssi: MarkusH from Google Meet share his key use case, adaptation
`graph.devices` could help identify:
a) what resources a misbehaving model is using, and
b) which models are candidates to stop that would help the situations
Anssi: MikeW commented:
… "Another way of achieving the same thing is the web app sorts its workloads in priority, terminating lower priority ones (1). Or some type of metric reporting that the model was stalled K ms waiting to run due to other work on the system and took S ms to complete (2)"
MikeW: the problem is that the information on which device has been selected isn't static
… a workload that has run on a GPU may run on the NPU the next run, or fallback to CPU
… I can see the value in expressing the graph can run on an accelerated unit, but reporting the last device on which it has run is not very reliable
reillyg: is there still value to report on which devices the workload might run? e.g. gpu or npu; would that be good enough for applications?
MikeW: could we just return that it can run accelerated vs a specific device value?
… the distinctions on specific hardware types are evolving, and it's not obvious it's needed for the app
MarkusH: I think that could work
Erik: an app author might want to know how much of the workload on which unit
… I'm not sure the proposal on the PR provides enough reliable context
dom: two aspects, things you want to operate and course-correct live and things you want to monitor to know if want to modify system later, is separation of concerns approach appropriate here?
MarkusH: when we detect that we're not operating in real-time compatible ways, we need to take action
… the proposal in the PR could help; an accelerated flag would probably suffice
phillis: if it's hybrid, what do we report?
dom: would an enum be actionable?
markush: in practice, it would depend on how much it runs on the CPU
reillyg: there is a cost in using multiple units (even GPU + NPU) - so maybe "hybrid" is worth reporting in general
… at some point, some of the performance detection can only be done by the app developers
RESOLUTION: Phillis to refine the proposal to reflect an accelerated status, with discussions on hybrid still TBD
<mgifford2> How much of this is inherent to hardware design? Will switching costs matter in 5 years? Probably. What influence might the W3C have in the future of what this technology makes available?
MLOpSupportLimits
Anssi: issue #759
<gb> #759
MikeW: we should define limits that are supported across all devices
reillyg: we have this in the test; a goal for us implementation-wise is to make sure that the implementation we have can implement all operators, and for those operators that can't be made optional
Customer feedback & collaborations
Anssi: customer feedback, including end-users, frameworks, independent software vendors, is extremely important throughout the process of developing new Web APIs, starting with use case identification, requirements gathering and hands-on feedback from early adopters, all the way to maintenance phase when large-scale deployment happens
… we have used a dedicated community-maintained repo, WebNN Awesome, to document various signals from customers and developers at large
webmachinelearning/
Anssi: I recognize many customers are not comfortable to publicly speak for their future product's use of WebNN API at this time, so I ask for sensitivity in this regard toward that
… that said, we have some brave early adopters who have worked with us in public
… kudos to the Google Meet team and Markus in particular for sharing feedback, reviewing our proposals and also submitting new feature requests for considerations
RTC-style workloads with response time requirements
Anssi: issue #898
<gb> Issue 898 Support for workloads with response time requirements (realtime) (by handellm) [Agenda+]
Anssi: Markus provided customer feedback from Google Meet product where RTC-style workloads have strict response time requirements
… assumption is the system while not under load is able to execute the workload
MarkusH: I see a future where we run more and more concurrent ML workloads on our system
… if the system can't detect what's real-time or not, it may not be able to orchestrate it
… e.g. audio processing needs to be run within certain time requirements to the risk of audio glitches or robot voices
… if we can't rely on these deadlines being respected, this creates an adoption blocker
… the same is true (with a different scale) for video processing
… also, there is prioritization - not all audio processing may be as critical
… we've also documented situations of misbehaving concurrent workloads
markusT: it feels like a hard to address problem in general
… e.g. ONNX runtime doesn't have a sense of real time
… tasks get queued, so if it gets queued behing a slow task (e.g. an LLM request), you can't really accelerate this
… not sure there is a prioritization mechanism on all type of devices
… even getting this orchestrated on native is hard, because the frameworks don't support the infrastruture you would need to execute properly
Tarek: do we really want to do that in WebNN?
… we're starting from this situation of wanting to run concurrent workloads via a background utility
… should this be done by the app or in the backend? does it even make sense to run several things on a GPU
MarkusT: do you know how much the task will take of the available window?
MarkusH: on CPU, this is solved problem with OS priorities
… when workloads get interleaved on GPUs, there is an opportunity for prioritization
MarkusT: pre-emption is now available on GPUs, but it is much more expensive than a CPU
… but overall, this gets us back to my device selection issue
… e.G. audio processing you probably want on CPU, where the data is anyway
… conversely, video processing happens on GPU, and you'd want to use the device used to render the video as well
MarkusH: audio processing might be best run on NPU for power efficiency
MarkusT: that's where it' useful to know which devices are available, and support benchmarking
RafaelCintron: hints could communicate priorities, that would map to processing queues
dom: reflection, this need to orchestrate processing across latency and power efficiency, we find this around many APIs on the web platforms, and each time we create a hint we should align across APIs
… you don't want to switch from one GPU to another GPU, you want continuity, but how to describe that in a declarative way is the question
… these are more general questions, Google Meet is a good use case to look at RTC-application requirements
MarkusH: I was discussing with Youenn of the concept of worker priority that was proposed by Intel a couple of years ago
… e.g. an "audio" worker priority would be exposed to WebNN
… and influence how that job would run
Web WOrker Quality of Service breakout presentation at TPAC 2023
Implementation plans and trials
Anssi: in this session we'll discuss and share the latest news on the next step for implementations, Origin Trial or equivalent, new requirements or feedback from updates in backends, frameworks
… but first, we kick off with exciting demos to raise your appetite!
[showing running WebNN Stable Diffusion Turbo both on GPU and NPU]
[showing WebNN Segment Anything demo running on GPU and on NPU]
[showing WebNN Whisper Base on GPU and NPU]
[showing Background removal based on MODnet, demo hosted on hugging face, on GPU & NPU]
[showing real-imte object detection w/ Yolo12n]
[real-time depth estimation w/Depth anything v2]
[demo of background blur done with WebNN on a full WebGPU pipeline, with 23% improved performance and 17% lower power consumption]
Anssi: thank you Ningxin for these compelling demos!
Browser vendors' trials
Anssi: we've discussed on our telcons that Origin Trial in Chrome is getting closer, latest discussion:
https://
Anssi: we also discussed Edge works in upstream with only a small 5-10 days delay and will launch an Origin Trial in sync
… more information about Origin Trials will be made available at:
reillyg: I imagine this landing in the 2nd or 3rd release of the new year in Chrome OT
Horizontals
Anssi: in this session we get to know experts behind horizontal groups
Ethics
Anssi: for ethics, I've proposed to make this group's Ethical Principles for Web Machine Learning a joint deliverable with the Web & AI Interest Group, a group that is being proposed
… by doing this, we can tap into the expertise of that Interest Group to help advance this important deliverable on the W3C Note track
… we currently refer to this doc in the WebNN spec Ethical Considerations
https://
https://
dom: ethics has not received a lot of bandwidth which is why we propose to make it a joint deliverable with Web & AI IG, the document was written 2022-23 so long time ago considering the rate of development in AI space
dom: also Ethical Web Principles has been endorsed as W3C Statement
Sustainability
Anssi: I've asked Mike Gifford, co-chair of the Sustainable Web IG to talk about the work done in that group
https://
Anssi: per TAG review feedback, we're expected to evaluate sustainability impact of WebNN, see issue #861
<gb> Issue 861 Evaluate sustainability impact (by anssiko) [tag-needs-resolution] [Agenda+]
<mgifford2> https://
<mgifford2> w3c/
<gb> Issue 139 Adding a comment about what is or isn't included with AI (by mgifford) [enhancement] [editorial]
MikeG: we're an environmental and climate crisis which we need to integrate in our work
… the goal of the Web Sustainability Guidelines is to create a Web standard that other institutions can use as a guidelines to evaluate how sustainable their technologies are
… the size of average Web page has grown beyond the scale of the information it's providing
… which relates to Web performance, although we have considerations that are completely separate - e.g. water consumption
… We're here because AI is changing a lot for the Web; we're seeing agentic browsers, the raise in AI in everything whether you want it or not, with a huge environmental impact, with data centers growing impact on electricity, water, sound
… we're interesting in evaluating overlap between our groups, the impact of decentralizing AI inferences to devices on environmental impact given their lower optimization compared to data center
… we have a few questions:
… - what advice can you give us as we're starting to write up guidance on accessibility [suspect AI was meant AI]
<dom> s/.AI//
Anssi: what is the best way for participants of this group to help with this? github repo?
mikeG: yes
Anssi: are there AI-related issues we can help with? initially AI wasn't really part of the scope as I understand it
MIkeG: right - we can't not address this given the impact of AI
… our guidelines are expected to address different context and audiences
… we're not sure yet on whether to include AI in a cluster or distribute it across the document
Anssik: for this group to help, having AI-focused content would make it easier
MikeG: we have lots of infrastructure to help navigate the guidelines and issues through a well-defined taxonomy, which will help with this
… there is also the question of data centers on which we could use expertise from people here
dom: sustainability is currently and IG that's working on a note and the direction is toward a horizontal group
… horizontal definition is more of a cultural one, ethical web considerations tell us to consider sustainability
MikeG: how does your group deal with a fast-evolving ecosystem such as AI?
Anssik: we try to find the right level of abtractions that stand the test of time, as Web standards have tried to do
… similar to the discussion about to what extent the NPU/GPU distinction matters
<AlexDawson> W3C already has Societal Impact self-review, there is scope for a potential self-review for sustainability in the future.
reillyg: we also depends on what developers will want to use to provide the best UX
… so we're more reactive where the sustainability work would be more proactive in pushing in a given direction
MikeG: aligning incentives towards good sustainability is a key challenge we face
… Small vs Large Language Models: the former seems more environmental friendly; but will that distinction remains relevant over time?
Anssi: the Mobile Web Best Practices document had that very issue
Tarek: re SLM, at Mozilla the definition we used a year ago no longer works today
… we're looking at device tiers: non-capable, device with certain capabilities, high end devices
… we've found that more robust over time
MikeG: any suggestion on how to classify models instead of devices?
Tarek: anything that doesn't spit out a continous stream of tokens is a SLM
Sushanth: we put the boundary at 7B parameters
Tarek: but it's at risk of changing in a few months
Thomas: I don't think the # of parameters in a guideline context: the guideline should be about "use the smallest possible model"
… with the caveat that an already-available model on device might be a better option
Anssi: my mental model to the model selection problem: use the "right tool for the right job" but it is complicated because the diversity to toolboxes available to people
anssik: re model selection, this should be about selecting the right tool to the right job, but it is a complicated evaluation to make given the variety of options available
Privacy and Security
Anssi: late last year, the Privacy Interest Group was launched replacing the Privacy Working Group
… what's new in this transition?
https://
Tara: the transition of the IG to WG hasn't really changed much in terms of the review work
tara: Simone and I are going to run a joint presentation
anssik always struggle a bit to delineate between privacy and security
tara: they do have a lot in common
… we have specialized guidance, but this shouldn't be a source of concern on your end
Anssi: Security Interest Group was recently launched to reinvigorate work to advise groups developing standards on how to avoid and mitigate security issues
Slideset: https://
ThomasN: re fingerprinting, one of the perennial that keeps popping is that there is already so much entropy that it's not obvious how much of it can still be mitigated
… is this a tractable problem, and something that is worth spending time mitigating at the spec level?
Tara: I think we're pushing towards a better space and so we feel it's worth considering the trade-offs that keep that path open
ThomasN: it's hard to evaluate the cost in developing the API and more importantly the ability of developers to fulfill their use cases
christianliebel: the APIs we build in the CG/WG are on-device
Wrap up
Anssi: thank you everyone for your active participation and productive discussions
… this day was packed and we managed to finish with gusto!
christianliebel: how different a trusted executed environment on cloud would be from the security/privacy perspectives?
Anssi: special thank you to our guests Mike, Tara, Simone, who joined to share important work happening across horizontals
… also huge thanks to Ningxin & team for the case study and compelling demos that both inform our future direction and demonstrate the exciting web experiences we already enable today with WebNN
… interested folks are welcome to join us for a dinner
… we're quite many, so the plan would be to meet in the Portopia Hotel (adjacent to the Kobe International Conference Center) lobby at 18:15 to coordinate on transport and restaurants, likely split to multiple based on preferences