WebML WG Teleconference – 27 February 2025

Meeting minutes

Repository: webmachinelearning/webnn

anssik: please welcome Brad Triebwasser from Google to the WebML WG and CG!

Announcements

WebML Working Group Charter Advisory Committee review open until 2025-03-05/06

anssik: WebML WG Charter AC review open until 2025-03-05/06, please reach out to your AC rep and ask to vote

Locate your AC rep (W3C Member-only link)

Voting instructions for AC reps (W3C Member-only link)

anssik: I can help establish connections to your AC reps

Authentic Web workshop

anssik: W3C is hosting a mini-workshop series to review proposals to combat misinformation on the web
… this is a follow-up to two TPAC 2024 breakouts that discussed Originator Profile and Content Authenticity proposals, these proposals are linked from the invitation:

-> Invitation to Authentic Web mini-workshop series - session 1, 12 March 2025

https://lists.w3.org/Archives/Public/public-webmachinelearning-wg/2025Feb/0004.html

anssik: this 1-hour workshop is open to all, including public, so feel free to pass the link to your friends who may be interested

Incubations summary

anssik: I asked Etienne to share a summary of the CG's 26 Feb telcon, thanks Etienne for taking notes
… since the CG meeting time is not very EU-friendly, many EU-based folks myself included will happily catch up with discussions on this call
… Tarek from Mozilla shared he was on vacation so couldn't join this time

WebML CG Teleconference – 26 February 2025 - 00:00-01:00 UTC

Etienne: summary of the current API proposals in the CG was discussed for Prompt API, Translator & Language Detector API, Writing Assistance APIs
… Microsoft has contributed to structured output for Prompt API
… Chrome believes that could make it feasible for the open web

Etienne: feedback and concerns from ai.* pattern, conflicts with minimizers
… OT 5x more sign ups than usual, overall great feedback
… Firefox AI Runtime, model caching garnered interest
… sharing models across origins, Kenji has received some feedback for this feature
… will use GH issues for topic tracking in the future meetings

Christian: happy to see the minutes and/or summary

Query mechanism for supported devices

anssik: after our last meeting, we decided to remove the MLDeviceType as the first phase of our device selection solution

<gb> MERGED Pull Request 809 Remove MLDeviceType (by zolkis) [device selection]

anssik: now, as the next phase, I'd like to continue discuss the query mechanism
… thanks everyone for your contributions in the issue!
… the approach I'd like to try here is to document real-world use cases, then assess implementability and only last go deep into the solution space
… Markus and Fredrik shared a real-time video processing use case, quoting:
… 1. If the user selects to use functionality like background blur, we want to offer the best quality the device can offer. So the product has a small set of candidate models and technologies (WebNN, WebGPU, WASM) that it has to choose between. Accelerated technologies come with allowance for beefier models.
… 2. The model/tech choser algorithm needs to be fast, and we need to avoid spending seconds or even hundreds of milliseconds to figure out if a given model should be able to run accelerated. So for example downloading the entirety (could be large things..), compiling & try-running a model seems infeasible.

webmachinelearning/webnn#815 (comment)

<gb> Issue 815 Query mechanism for supported devices (by anssiko) [device selection]

anssik: the explainer has been updated with this use case, more use cases welcome!
… I derived the following requirements from this use case:
… - query mechanism should be available before model download
… - query mechanism should signal explicitly if the context is "accelerated" for the given model to allow developer-defined fallback to other technologies (e.g. WebGPU, Wasm)
… - query mechanism must signal explicitly if it downgrades the requested "accelerated" context to "non-accelerated" context

anssik: position from Mike/WebKit is clear:
… - query mechanism should be based on capabilities, not on device names such as "gpu", "npu"

anssik: comments from Dwayne for data types:
… - Model data type (like float16) can make a big perf difference, and so knowing before model download is useful.
… -- float32: runs on both GPU and CPU, but not any many NPU's
… -- float16: works on NPU. On GPU offers ~2x boost over float32, but much slower on CPU because CPU's lack dedicated float16 hardware.
… -- int8: works well on CPU and NPU, but then GPU's suffer.

anssik: known implementation constraints:
… - Reilly notes frameworks/backends don't expose enough information to determine ahead of time whether a given operation will be optimized or not, or is emulated
… - Ningxin notes even if an op is emulated by browser, the underlying framework may still be able to fuse the decomposed small ops into optimized one, which we can't tell either.
… - Ningxin surveyed backends:
… -- DirectML is always an "accelerated" context
… -- TFLite can check if the context is "accelerated" (fully delegated in TFLite terms)
… -- Core ML always has a CPU fallback
… -- ONNX Runtime allows to disable fallback to CPU

anssik: Reilly noted that from interoperability perspective, the CPU fallback is considered positive

Resolving tension between interoperability and implementability

anssik: other considerations:
… what is actually "accelerated"?
… Reilly said "I might call a CPU inference engine using XNNPACK "accelerated" since it is using hand-optimized assembly routines tuned for a handful of CPU architectures rather than a naive implementation in C++"

anssik: I'd like to open the discussion, use cases and implementability consideration first
… solutions only second

Zoltan: the summary was good, I'm wondering since we have conflicting requirements, on some platforms cannot avoid fallback, thus want to go with introspection
… start with a generic introspection mechanism and add capabilities that can be implemented?

jsbell: not sure what are the next steps right now
… introspection API seems quite hard, how to define "accelerated" and Ningxin's comments on even if the browser implementation thinks something is accelerated it may not be deep inside or vice versa

Zoltan: can we try add "avoid CPU fallback" option?

jsbell: exploring further hints might work, have discussed with Reilly that if the web app was using compute unit e.g. GPU, add a mechanism to be able to say "please don't use this compute unit"
… this does not satisfy the "do not download the model" requirement

RafaelCintron: interesting point that was brought up, that we should allow developer to define "avoid GPU"
… in the spec we can give WebGPU context, people wanted that to be a hint "do use GPU" instead
… as for the downloading a model being a requirement or not, I wonder if people are happy if we don't download weights, just the graph topology

RafaelCintron: depending on ORT EP we could possibly download the graph without weights
… Reilly's decision tree in the issue is a nice example
… it is important to tell "accelerated" or "not accelerated"
… agree with Mike that it is dangerous to use e.g. CPU and NPU to make assumptions on compute unit characteristics or performance based on names

RafaelCintron: we should talk with Markus more to understand the exact models that he wants to run and see if we can design an API to the frameworks underneath, Google Meet is a great use case to validate the API design against

ningxin: my comment is about "avoid CPU" or "CPU only" option, Anssi mentioned multi-stage query mechanism, probably we can allow an application to know if the context is CPU only before download
… e.g. TFLite is CPU-only currently without Delegates
… in such a case a web app might opt in to use alternative API such as WebGPU
… in the second phase, developer may want to specify a hint to avoid CPU fallback, TFLite or DML or ORT we can disable CPU fallback, this has the model topology known, if this causes an error, developer can know

Dwayne: in ONNX format you can split the weights out, but cannot easily download only the topology
… could download protobuf, not sure how to do that with existing APIs

<Joshua_Lochner> I have been able to load the graph topology without downloading the entire model. Unless you're referring to internally?

Joshua_Lochner: I was investigating this recently, download all ONNX weights on Hub and see what ops are supported, for debugging purposes using Netron
… is this something internal to the runtime that you don't support external data formats?

Dwayne: you can do that, but using existing libraries such as ORT Web it is not possible to download the model without weights

<Joshua_Lochner> yes, exactly

Operator set Wave 3

anssik: PR #805

<gb> MERGED Pull Request 809 Remove MLDeviceType (by zolkis) [device selection]

anssik: huge thanks to Dwayne for all the updates to the PR!
… and Ningxin and Josh for your review suggestions
… the PR is now marked as "ready for review", everyone PTAL!
… recent updates include:
… - GatherND example updates
… - gatherND and scatterND algorithm steps
… - Update slice algorithm steps for strides
… - Update blockwise broadcasting to return true/false
… Dwayne, feel free to share what folks should look at in particular in the final review?

PR #805

<gb> Pull Request 805 Operator set wave 3 (by fdwr)

Dwayne: thanks Nignxin and Joshua for all the feedback!

anssik: I'm planning to initiate TAG review soon and this op set Wave 3 is one important piece of that review scope
… I support the idea of splitting u/int4 into a separate PR, we could initiate the TAG review without blocking on it

Dwayne: could move forward with TAG review without u/int4

<jsbell> Bikeshed is reporting errors on latest changes https://github.com/webmachinelearning/webnn/actions/runs/13563543433/job/37911433004?pr=805 I'll review over next few days

<ningxin> I'll take another look, thanks so much!

Rounding operators

anssik: issue #817

<gb> Issue 817 Rounding operators (by fdwr) [feature request]

anssik: this issue opened by Dwayne contains detailed research into known gaps in rounding functions
… Dwayne explains:

"I lacked rounding functions in WebNN, particularly the default IEEE standard round-tied-halves-to-nearest-even. There's not a nice way to simulate this missing function, and it's a basic primitive that belongs in a library anyway. So I propose filling in some gaps"

anssik: there's a detailed rounding modes table with WebNN, IEEE, C++ equiv, JS equiv
… library support, examples, emulation
… based on this research a new roundEven() API is suggested

partial interface MLGraphBuilder {

MLOperand roundEven(MLOperand input, optional MLOperatorOptions options = {});

};

partial dictionary MLOpSupportLimits {

MLSingleInputSupportLimits roundEven;

};

anssik: I'd like to see others review the proposal and if no concerns this could be turned into a PR
… thank you Dwayne for documenting and educating us on the rounding function details!

Dwayne: TLDR: add one function that is consistent with IEEE rounding mode, it is important to express decomposition for quantizeLinear operator

ningxin: did you survey three major backends for support?

Dwayne: yes, there's a list of backends in the table

– DRAFT –
WebML WG Teleconference – 27 February 2025

27 February 2025

Attendees