WebML WG Teleconference – 29 January 2026

Meeting minutes

Repository: webmachinelearning/webnn

Anssi: please welcome the later new participant who joined the WG:
… Benjamin VanderSloot from Mozilla
… Dominic Farolino from Google
… welcome to the group, Benjamin and Dominic!

Incubations

https://github.com/webmachinelearning/meetings/blob/main/telcons/2026-01-22-cg-agenda.md

https://www.w3.org/2026/01/22-webmachinelearning-minutes.html

Anssi: first, the WebML Community Group transitioned from the WebMCP explainer to a Community Group spec draft stage and published the initial draft:

https://webmachinelearning.github.io/webmcp/

Anssi: the group will now port over content from the explainer to formalize the proposal

Anssi: second, an initial proposal for W3C WebML CG and Agentic AI Foundation coordination was reviewed and discussed
… third, the group resolved to evolve the declarative proposal to expose tools via HTML in parallel with the imperative WebMCP API
… for Built-in AI APIs, we welcomed new editors on board, Reilly and Ehsan picking up this responsibility for the Translator and Language Detector APIs
… we also discussed implementers' priorities for 2026 to inform the group
… Mike shared Chrome's biggest focus for Built-in AI APIs is the Prompt API
… lastly, we resolved to shift the WebML CG call forward by one hour
… we will keep this WebML WG call at its current timeslot in appreciation of our PRC, Japan and APAC participants who join already now at very late hours

HTTP Archive’s annual state of the web report for Generative AI

https://almanac.httparchive.org/en/2025/generative-ai

Anssi: it is worth noting this report discusses the pros and cons of cloud vs local inference in its technology overview

https://almanac.httparchive.org/en/2025/generative-ai#cloud-versus-local

Anssi: thank you to the team who produced this extensive report

Candidate Recommendation Snapshot published

Repository: webmachinelearning/webnn

WebNN API spec release history

Anssi: on 22 January 2026 we published a new Candidate Recommendation Snapshot (CRS)
… since previous major WebNN CRS publication (11 April 2024) we have made over 100 significant changes
… this WebNN "v3" milestone release added new ops and datatypes, improved API abstractions and developer ergonomics, interoperability, added new horizontal considerations and more
… the group received kudos for its work with horizontal groups and topics: privacy, security, sustainability, ethics
… the GitHub CI/CD is now configured to publish new Candidate Recommendation Drafts

and the spec editors are welcome to proceed to merge any open PRs that were awaiting as usual
… huge congratulations to the group for this significant publication milestone!

Christian: per HTTP Archive’s annual state of the web report interest in client-side AI is growing massively
… WebLLM and Transformers.js use has grown sharply

Accelerated context option implementation feedback

Anssi: issue #911

<gb> Issue 911 accelerated should be prior to powerPreference for device selection (by mingmingtasd) [device selection]

Anssi: we will discuss new implementation feedback and seek consensus on the proposed spec change to add "no-acceleration" to powerPreference enum

Anssi: for implementation feedback, we have new information from rustnn and Chromium

rustnn/rustnn#backend-selection

https://chromium-review.googlesource.com/c/chromium/src/+/7513189

Anssi: for rustnn, Tarek implemented backend selection based on the current spec using two hints passed to createContext:
… boolean accelerated = true;
… enum MLPowerPreference { "default", "high-performance", "low-power" }
… for Chromium, Mingming prototyped the new proposed "no-acceleration" value for MLPowerPreference enum without accelerated boolean:
… enum MLPowerPreference { "default", "high-performance", "low-power", "no-acceleration" };

Anssi: I believe both implementations expose the MLContext.accelerated boolean and we have an emerging consensus on that part
… the hints provided at createContext time are still being discussed

webmachinelearning/webnn#911 (comment)

<gb> Issue 911 accelerated should be prior to powerPreference for device selection (by mingmingtasd) [device selection]

Anssi: in issue discussion, Zoltan notes "no-acceleration" would be a good context option but not a power preference per se
… WebGPU/GL use powerPreference and I believe is what developers expect to see in similar APIs that interface with hardware such as WebNN API
… per Priority of Constituencies principle, I would suggest we consider developer ergonomics over theoretical purity in this case and do not rename powerPreference even if we choose to include "no-acceleration" in this enum
… we can explain the naming issue in the spec prose

webmachinelearning/webnn#911 (comment)

Anssi: Bryan comments that MLDevicePreference 'high-performance' is too ambiguous for WebGPU interop scenarios
… Bryan's problem statement:
… "In a hybrid system, WebNN might resolve to the NPU while WebGPU resolves to the dGPU. Since the adapters won't match, interop won't work."
… feedback summary is a powerPreference enum alone is insufficient

Rafael: I think there's unfortunately diversity in the ecosystem, multiple adapters, hybrid adapters, I think it is important that the WebNN agrees
… WebGPU "high-performance" and WebNN "high-performance" should pick the same adapter
… in the past we used to have GPUDevice, when you make a context you specify you want to do WebGPU interop with this device, I think we removed that from the Chromium implementation
… I'm personally OK to have no-accelation enum in powerPreferences, also can live with it being its own boolean via fallbackAdapter in which case powerPreferences is ignored
… WebGPU and WebNN selections should agree so they stay consistent
… no way to rationalize what happens if WebGPU and WebNN pick a different adapter

Ningxin: Mingming's idea is to explore Rafael's proposal and see how it can simplify the mapping to ONNX Runtime backend, ORT has its device selection policy with its own enum values

Anssi: does the Chromium prototype map directly to ORT policy?

Ningxin: makes it easier to map to it

Anssi: how about other JS frameworks?

MarkusH: I'm a bit late to this issue, I think for audio and video real-time inference, it is important to spec options so that you don't get executed on HW that could be shared with other apps on the system
… there's a case for accelerated video where we accept any device, not just the fastest one
… for video case we could not get executed on CPU, the latest comment from Rafael to use the same preferences over WebGPU and WebNN, maybe you can create WebGPU context from WebNN context if the values need to match

MikeW: the option seems reasonable in general, but the name could use some thinking, sometimes least power-consuming or fastest, or most optimal device is the accelerated device, maybe like "fallback" as a name would be good instead

Markus: maybe independent of power options, ML is a huge pipeline, and looking at all the APIs involved, Web Audio, WebGPU, shouldn't we have an interface that says give me the best end-to-end optimized device?
… if you want to do audio, maybe WebGPU is the worst thing to do

Markus: we are thinking is the developer able to defer some hints from where the input comes and where the output goes, e.g. input from WebGPU we never want to do CPU post-processing unless on UMA system
… mem copies take so much time

Markus: that sounds like our TPAC discussion regarding worker QoS, someone from Apple and Google suggested a way to describe to the entire pipeline, what are the concrete options for that, a new Web API or massage into WebNN?
… create options such as inMediaStream where you retain the context of the entire system

Rafael: Markus from NVIDIA has an interesting suggestion, we had a similar idea when we passed GPUDevice to context, we could go back to that design, but in that case you pass the GPUDevice and powerPreference and they may all disagree among themselves
… web developers perhaps should organize themselves
… when you make the context all tensors are in the same domain, some want to interop with WebGPU, some want to interop with Web Audio, that'd require two contexts with different interop requirements
… would web developers be fine with those tensors attached to different contexts not being able to access each other?

Markus: there's always a mem copy if you swap between contexts, we give web developers control, do we want to give that control?

Model Cache API

Anssi: issue #807

<gb> Issue 807 Caching mechanism for MLGraph (by anssiko) [question] [feature request]

https://github.com/webmachinelearning/webnn/blob/main/cache-explainer.md

Anssi: the group has done initial investigation and drafted a proposal for an explicit API for caching compiled graphs
… this ahead-of-time (AOT) compilation in particular benefits slower end-user devices
… the proposed API is documented in the explainer
… this reduces the overhead of repeated compilation
… I want to understand whether 2026 is the right time to reinvigorate this discussion on the Model Cache API
… also, I want to understand if we have new implementation experience that should be shared with the group and documented in the explainer
… from our past discussion, I recall we discussed how to make this work with Chromium sandboxing and storage partitioning constraints

Ningxin: we have this a priority for this year, web app developers and vendors are interested in this feature
… AOT compilation will be an alternative way to solve this issue
… from platform API perspective, we've worked with Windows ML team, there's an API to write a callback when you compile a model, through that you get a compiled blob, useful for implementing this in Chromium, pass this compiled blob from a sandbox to storage management
… we plan to experiment with the API
… for loading, we will investigate if we can let the browser process save the model file and pass a duplicate handle to GPU process and map that model into memory and share Windows ML ORT API, to load a compiled model
… these two APIs are available and we plan to experiment with a prototype and report back to the group
… last year we shared initial investigation results from our non-sandboxed implementation, writing to a file directly, to get initial performance data, with the new architecture we expect to see better performance
… the alternative design from Rafael was discussed earlier, buildAndSave(), to understand memory footprint impact, we explore this further

Floating point accuracy for sin and cos with range constraint

Anssi: WebNN issue #914

<gb> Issue 914 Floating point accuracy for sin and cos with range constraint (by philloooo) [operator specific]

Anssi: I want to discuss the proposal to define accuracy for sin and cos
… Phillis reports the WebNN spec does not define the accuracy of floating point operations
… she identified this while running WPT that caused failures on TFLite GPU backend
… also notes WebGPU Shading Language accuracy for some built-in ops, including sin and cos but limits to the range of [-PI, PI]

https://gpuweb.github.io/gpuweb/wgsl/#concrete-float-accuracy

Anssi: Phillis' offered two possible options for the group, verbatim:
… a) align with the WGSL spec, so no guarantee for inputs outside of [-PI, PI]. And update the WPTs.
… b) using the periodicity of the cos/sin function, perform range reduction by emulation before calling cos
… Ningxin SGTM'd option A
… also Dwayne seems to be fine with A

Anssi: currently WPT contains tests for tolerances but the WebNN spec does not include any language about it
… Dwayne documented the WPT tolerances in a GH issues a few years ago and I believe this is what is currently codified in WPT:

webmachinelearning/webnn#265 (comment)

<gb> CLOSED Issue 265 WPT tests tracker (by BruceDai)

webmachinelearning/webnn#338 (comment)

<gb> CLOSED Issue 338 WPT tests tracker / 2 for remaining ops (by BruceDai) [testing]

Anssi: Dwayne's suggestion is to first verify the WPT results across all current backends
… and after this, put them into in the spec

Dwayne: that is what I recommend

<ningxin> +1 to Dwayne's proposal

Markus: option A is preferred for fast execution

Anssi: I hear the group prefers option A and Dwayne's proposed next steps

The minimum data type set

Anssi: issue #853

<gb> Issue 853 The minimum data type set (by huningxin) [operator specific]

Anssi: the context of this issue is "the minimum data type set" represents the data types implementable across all Chromium backends
… in this investigation, we have identified certain inconsistencies (some of which can be possible implementation bugs to be fixed later) and I'd like the group to weigh in on our preferred approach
… do we want the opSupportLimits API to:
… a) report things exactly as-is from the underlying backend, even if inconsistent
… b) report predictable results, even if that sometimes requires us to normalize the outliers

Dwayne: if it's cheap like with reshape option B is preferred

Anssi: any objections to proceed with option B?

<ningxin> +1 to option b

[no objections heard]

Support flexible input sizes

Anssi: issue #883

<gb> Issue 883 Support flexible input sizes (by huningxin) [feature request] [operator specific] [Agenda+]

Anssi: we resolved in Kobe F2F to do more prototyping before specifying a solution to flexible input sizes
… Tarek built a webnn-graph tool that converts ONNX dynamic graphs into WebNN static graphs, and then back to ONNX for execution via the ML graph builder

rustnn/webnn-graph

Anssi: Tarek reports this worked well for models such as MobileNet and a sentence-transformers model all-MiniLM-L6-v2
… but it did not work for text generations models that rely on dynamic shapes, common in modern LLMs
… core issue is constant folding that bakes in fixed input size for LLMs

Anssi: Ningxin notes that with ORT operators with dynamic input size currently fall back to Wasm EP, that causes suboptimal performance
… a workaround is to prepare static models for prefill and decode stages and set sequence length to max value
… cons: higher mem usage, 2x compile time, requires padding to max length, complex to deploy multiple static models
… Tarek proposed to add MLOperandDescriptor.shape for optional symbolic dims to be bound at compute() and updateSlice for KV cache write at decoding time

MLDynamicDimension Chromium prototype

Anssi: Tarek had a conflict with this meeting, but I discussed with him and I'm proxying his feedback:
… "Ningxin's change on #883 is really promising, my feedback to the group is that I am planning to implement it on rustnn this week to try it out and come back with the results. If it works well I'd be +1 to do this."

Ningxin: open question in the comment is whether we should introduce runtime shape related operator, for example shape operator that can put input operand into output operand, there are also e.g. dynamic shape
… dynamic reshape allows another input operand, can only know at inference time its shape
… for the second model, dynamic slice, start and end, specific value must be known at inference time, ONNX spec has these ops
… for WebNN introducing dynamic reshape/slice, shape is unknown at build time, thus cannot do validation at build time
… open questions to the group whether we consider having those dynamic* operators or have native ML runtime do the shape inference
… another option is to explore whether we could run these LLMs without those ops, embedding position ids as constants, slice the piece of model to use, without dynamic slice must use JS to do slice on CPU side and feed it into the input operand
… this makes the conversion and tooling more complex, this is a tradeoff
… group's feedback welcome

<anssik> s/sometimes least/sometimes the least

– DRAFT –
WebML WG Teleconference – 29 January 2026

29 January 2026

Attendees