Web Machine Learning WG F2F – 23 September 2024

Meeting minutes

<gb> Issue 25 WebML WG - TPAC 2024 agenda (by anssiko)

Repository: webmachinelearning/webnn

Welcome

<Rachel> p+

anssik: this is our first F2F as a WG, despite us having existed for a long time
… I'm Anssi Koistiainen from Intel, chair of the WG, supported by Dom as our staff contact; thanks to the TPAC organizers to make this happen
… great to see both long time participants and new faces
… this WG has now all the major browser vendors as participants, with Mozilla joining recently
… a diverse set of experts at the intersection of expertises around AI, from different backgrounds
… incl library makers that help us calibrate our work to real world requirements
… people from industry, research background - let other people know they can get involved

[round of intros]

<Rachel> Rachel Present+ Rachel_Yager

Charter orientation

Anssi: we are 2 groups: the Web Machine Learning Working Group, and its eponym Community Group
… the WG deliverables includes the WebNN API (the core focus of the technical work), ethical guidelines (a topic on which we will get an intervention from openAI)
… and work on model loader API which is blocked on standardized model format
… the CG is responsible for incubating proposals some of which may graduate later to standardization

Dom: note that the WG charter is expiring next year, so we'll need to start discussions about potential additions in the next few weeks/months

Anssi: the CG allows to do exploratory work - that's how WebNN itself started

Anssi: We're also chartered to coordinate with other groups: WebGPU WG (with related topics on our agenda, e.g. MLTensor)
… the WebAssembly WG is also an important related group (with Deepti in particular helping coordinate)
… also important integration questions around WebRTC which Ningxin explored a couple of years ago
… Also working closely with the Technical Architecture Group which helps us making sure our API fit well in the broader platform

Ethics

Democratizing Human-Centered AI with Visual Explanation and Interactive Guidance

Slideset: https://lists.w3.org/Archives/Public/www-archive/2024Sep/att-0005/jay-wang-w3c-webml-compressed-selected.pdf

Anssi: please welcome Jay, a safety researcher at openAI who will explain to us his research focus on making AI more accessible through novel interfaces
… You're also part of Georgia Tech, have published many papers and open source tools

[demo of https://bit.ly/cnn-explainer]

anssi: thank you for this very comprehensive presentation
… a specific intersection with our work you alluded to is possible integration of some of these tools in browser developer tools

Kenji: how hard is to get the model to explain its behavior (e.g. in the example in Gam Changer around age/risk)

Jay: this particular model was a simple regression model where it is easier to identify the particular source of the model behavior

Rachel: why are human hands so problematic to AI image generators?

Jay: the geometry of hands have been really hard to capture for models, but they're improving

McCool: re WebNN, is there any gap related to your work?

Jay: my tools are mostly based on TensorFlow.js
… there may need to be different modalities of input, with different ways of embedding the vectors, to cater with the emerging needs from generative AI

Anssi: thanks again Jay, we hope to work more with you

Spec orientation

<gb> Issue 375 Support for transformers (by dontcallmedom) [v2] [opset]

<gb> Issue 559 Control flow operations: if, while (by philloooo) [opset] [feature request]

<gb> Issue 666 Reconsider `MLOperand` methods (by a-sully) [question]

https://github.com/webmachinelearning/webnn/issues

Triage guidance

"interop" issues

<reillyg> I propose closing webmachinelearning/webnn#11 as obsolete since we've decided to pursue a graph API.

<gb> Issue 11 Executing operations (by anssiko) [feature request]

<McCool> two things that I noticed: int64 issues (relates to interop) and use of constants with MLTensor (enable potentially useful mechanism for model management)

<reillyg> I propose closing webmachinelearning/webnn#541 in favor of webmachinelearning/webnn#754 (can this be merged?).

<gb> Pull Request 754 Add MLTensor explainer (by a-sully) [webgpu interop]

<gb> Pull Request 541 Add MLBuffer exploration doc (by a-sully) [webgpu interop]

<jsbell> Propose closing this group of "simplify" issues unless someone strongly advocates for them soon: #474, #470, #374, #324

<gb> Issue 374 Simplify `MLPool2dOptions` by removing the `outputSizes` option (by huningxin) [operator specific]

<gb> Issue 474 Simplify `resample2d` op (by huningxin) [operator specific]

<gb> Issue 470 Simplify `matmul` op (by huningxin) [operator specific]

<gb> Issue 324 Simplify the operand layout support of conv2d and pooling 2d operations (by huningxin) [feature request] [operator specific] [interop]

<jsbell> (sorry Ningxin!)

<BryanB> Issue 749 MLContextOptions.deviceType seems unnecessary outside of conformance testing

<gb> Issue 749 MLContextOptions.deviceType seems unnecessary outside of conformance testing (by mwyrzykowski) [device selection]

<reillyg> The most recent ~5 issues don't have labels. I don't have permission to add them.

<Dom7> reillyg: it seems that issue #11 was opened a long time ago and seems like it can be closed

<gb> Issue 11 Executing operations (by anssiko) [feature request]

anssik: no one objecting to #11 getting closed, so let's do it

reillyg: there are two open pull requests on the MLTensor space; can we close the generic one and keep only the specific one? can we land the PR for the explainer?

Austin: I'll close #541

<gb> CLOSED Pull Request 541 Add MLBuffer exploration doc (by a-sully) [webgpu interop]

anssik: we should look at merging the explainer after our MLTensor discussion later today

jsbell: propose closing this group of "simplify" issues unless someone strongly advocates for them soon: #474, #470, #374, #324
… we should either do them soon or abandon them

ningxin: I think we can close #324

reillyg: in the coming implementation, I've added automatic transposes
… I think we can do without that particular simplification

anssik: so let's close #324

<gb> Issue 324 Simplify the operand layout support of conv2d and pooling 2d operations (by huningxin) [feature request] [operator specific] [interop]

<Domenic> (Has 2d vs. 2D been discussed? https://w3ctag.github.io/design-principles/#casing-rules )

anssik: re #470, do we need to retitle it? open a different issue?

<gb> Issue 470 Simplify `matmul` op (by huningxin) [operator specific]

ningxin: I'll retitle it to reflect its status

dwayner: I'll open a new issue instead, linking back to that one

<BryanB> I think we can close the old MLBuffer PRs in favor of the MLTensor explainer: #542 #543 #544

<gb> Issue 544 [MLBuffer] Support for MLBuffer in graph execution (by bbernhar)

<gb> Issue 543 [MLBuffer] Uploading/downloading tensor data (by bbernhar)

<gb> Issue 542 [MLBuffer] Creation and representing MLBuffer on a XPU devices (by bbernhar) [webgpu interop]

dwayner: re #374, we should probably align pool with conv - I'll propose next step in the issue

<gb> Issue 374 Simplify `MLPool2dOptions` by removing the `outputSizes` option (by huningxin) [operator specific]

ningxin: closing #474 SGTM

<gb> Issue 474 Simplify `resample2d` op (by huningxin) [operator specific]

New features

A refreshed analysis of popular models

#375

<gb> Issue 375 Support for transformers (by dontcallmedom) [v2] [opset]

Slideset: https://lists.w3.org/Archives/Public/www-archive/2024Sep/att-0014/WebNN_Operator_Update_Wave_3.pdf

33 Models, 12 Operators, proposed IDL, data types

<gb> Issue 375 Support for transformers (by dontcallmedom) [v2] [opset]

<gdti> Quick note: Memory64 for Wasm is available both on Chrome & Firefox nightly behind a flag for the last year or so - they're really close to being enabled by default and the proposal is stable just pending a phase 4 poll on closing out the last few spec issues

<gdti> We'd love folks to try it out and let us know if something isn't working as expected

ningxin: we have already some implementation experience with these

<Zakim> anssik, you wanted to propose next steps for #375 and to discuss priority of #559 and to bump the priority of #666

<gb> Issue 375 Support for transformers (by dontcallmedom) [v2] [opset]

<gb> Issue 559 Control flow operations: if, while (by philloooo) [opset] [feature request]

<gb> Issue 666 Reconsider `MLOperand` methods (by a-sully) [question]

dwayne: it has informed some of the proposals

anssik: I'm interested in general feedback on this approach to wave 3 operators, given current prototyping efforts

ningxin: scatterND helps with performance, not just functionality

jsbell: I ♥ the wave nomenclature - it may be useful to use it for our issues in the repo
… how far are we along for the implementation? any sense of when impl/spec will be ready to advance to origin trial?

dwayne: for ops completeness in the DML backend, maybe two weeks of implementation work
… I hope to add the ops to the spec in the same timeframe

McCool: how many of these models are actually useful in the Web context?

dwayne: huge models are good to demonstrate viability, but they're so big they're likely not practical to use directly in the browser

McCool: +1

ningxin: they were identified through popularity in transformers.js, so already used in the browser context

NeilT: any consideration of the ARM TOSA operator?

dwayne: I've looked at it and have the data on it that I'll share

anssik: hearing overall support for the approach; no specific plan for origin trial yet
… in terms of spec, do you expect any specific challenge?

dwayne: it should be pretty straightforward given our experience; a few interesting questions around invalid index conditions

ningxin: int4/uint4 data types will need attention and wide review

Quantization and dequantization

<reillyg> > #93 #128 #623

<gb> Issue 93 Add QuantizeLinear and DequantizeLinear for mixed precision (by kpu) [opset] [feature request]

<gb> Issue 128 WebNN should support int8 quantized models (by wchao1115) [v2] [opset] [feature request]

<gb> Issue 623 WebNN should support NPU and QDQ operations (by wchao1115) [v2] [opset] [feature request] [device selection]

reillyg: adding operators to explicit dequantize int8 and int4 vaues to float16/float32 makes a lot of sense to add to the spec
… an open question: do we agree that explicit dequantization is the approach we want to take?
… backends can detect a dequantize/conv2d operation, so expression this in the API itself is maybe unnecessary

dwayne: that matches the experience in ONNX

ningxin: quantize/dequantize makes it also easier to fallback
… for the backends

reillyg: if we expect that backends may strip out q/deQ pairs, how do we want to specifiy the behavior of rhte full graph from a precision perspective?
… how useful would the requantize operator to be?

dwayne: I have seen hundreds of models using the quantize operator

ningxin: the behavior is up to the implementation, but how do we specify this?

dwayne: does it change the overall precision?

reillyg: in TFlite, there is a top-level flag to control this

jsbell: one other approach we've seen, the scale is attached to the tensor
… I'm assuming we want to be explicit and not pursue this, but want to make sure we know of the options

reillyg: this also leads to a huge explosion of the type system

dwayne: I'll go with explicit

reillyg: if we support quantize/dequantize, what data types do we support? int8/uint8 seem obvious, int4/uint4 are more complicated
… representing int4 on the Web and across backends is challenging

dwayne: from a Web API perspective, they would be exposed as Int8Array
… from an implementation perspective, I'm not sure how to handle int4 as input

reillyg: quantization is most useful for weights; do we know of any need of it for input/output?

McCool: I've seen quantization for activation as well

reillyg: i.e. using it as another kind of an activation function
… I've linked to the 3 related issues - should we triage them into a single issue?
… either #93 or #128 (#623 as a bunch of unrelated aspects)

<gb> Issue 623 WebNN should support NPU and QDQ operations (by wchao1115) [v2] [opset] [feature request] [device selection]

<gb> Issue 128 WebNN should support int8 quantized models (by wchao1115) [v2] [opset] [feature request]

<gb> Issue 93 Add QuantizeLinear and DequantizeLinear for mixed precision (by kpu) [opset] [feature request]

reillyg: I'll clean them up now that I have the proper repo privs

Platform capability detection

reillyg: there is a bit of overlap between this topic and the next one on future proof device selection
… capabilities detection and device selection go hand in hand
… capabilities depend on the platform you're on, and which device you pick
… I was looking at the examples of how WebGPU handles this; in WebGPU, the first step the dev goes through is requesting an adapter at which point the system devices which adapter to use
… the question raised in Mike's proposal is whether we can have the developers give us the set of features they want and whether we can fulfill that

Mike: in WebGPU you get a set of limits, with defaults but also maximum limits
… the defaults are guaranteed to run everywhere; if you ask something above the defaults, you can run on the particular device, but no guarantee to run everywhere
… so instead of only describing what the device supports, establishing a baseline of support that can run everywhere

reillyg: it's hard to have a default operator set that works across platforms, mostly because of datatypes support
… it seems it's not possible to give a baseline for models and datatypes with a guarantee to run everywhere
… any framework built on top of WebNN will have to have code to respond to the capabilities of the platform and tailor the graph as it is being built to tailor it to these capabilities
… given the current landscape of hardware support, I'm not sure if it makes sense to create a baseline set

Mike: it would be interesting to see what we're missing from this core set given the support of TensorFlow.js in WebGPU

asully: to get to a common set, one approach would be to relax the requirement that NPU matches with an NPU device

reillyg: with a restriction to GPU, could we identify a baseline operator set?

dwayne: probably

reillyg: if we said "if you're OK using the GPU and do float32, you're OK in the baseline"

ningxin: we have a bunch of ONNX models in our tests; they use int16; we've added support for int16 in opLimits in a PR, with a WASM fallback

asully: we could limit the size of indexes to int32
… to avoid the performance penalty of falling back to WASM

Mike: our goal is to avoid developers not running on platforms by inadvertance

<Zakim> reillyg, you wanted to discuss whether requesting additional operators up-front is practical.

Mike: we want to make sure it happens with a clear intent from the developers and with a sense they will build a fallback

reillyg: given that this is something that will be intermediated by frameworks, I'm wondering if this is something we want to do through an API or through developer tooling
… e.G. a flag to enable compatibility mode

<McCool> (my comment was to suggest a compatibility mode for testing, now covered)

reillyg: how does framework intermediation apply to the WebGPU case?

Mike: the engines tend to handle it for developers
… a developer tool setting sounds like a good idea

reillyg: the frameworks have a backup (e.g. go back to WASM), and it's relatively easy for them to detect when they go off limit
… we don't want frameworks to guess and check

mike: we could still expose maximum limits, but keep the defaults to a baseline

ningxin: we could also make efforts to promoting best compatible datatypes for the Web to the transformers tooling community
… this would reduce situations where frameworks have to fallback

reillyg: so is the concern with oplimits more of a compatibility rather than fingerpriting

dom: the Web platform is compelling enough as a distribution platform that it can drive convergence toward a baseline

reillyg: what we need to figure out is the minimal supported operator set and data types (assuming GPU execution)
… and with an opt-in parameter to request more than the default

anssik: this might help also with fingerprinting

RafaelCintron: e.g. implementors could not provide this "upgrade" path in a privacy mode

Future-proof device selection abstractions

Slideset: https://lists.w3.org/Archives/Public/www-archive/2024Sep/att-0006/MLDeviceType.pdf

[Slide 2]

[Slide 3]

reillyg: getting rid of the explicit device type makes sense
… the choice to me is between being very vague (with power preference) or a bit more specific: CPU-only, CPU+GPU, CPU+NPU, CPU+GPU+NPU
… to avoid compatibility issues with the ambiguity of power preference

Mike: as more WebNN adoption occurs, it would be great to get data on performance improvements actual apps would get if they had more guarantees on which device they run

reillyg: we would like to run further experiments to see if developers could actually target NPUs in a cross-platform compatible fashion given the level of diversity in NPUs on the market today
… so basically agree

Mike: if we had sufficient data to show that device type selection is necessary, we would be more open to it (as CoreML allows)
… our proposal is that initially we remove the concept, with an openness to reconsider it based on data

RafaelCintron: the windows ecosystem is a lot more heterogeneous, so not having device selection feels even harder there
… it's hard for the browser to make a decision since the model is only known once the data is buffered
… maybe this could be done with a different API shape
… Re privacy, how different is it for WebGPU? it seems you could do the same with increasingly complex shader

Mike: you can hide capabilities; you can't fully prevent it, but there are protections that have been added to WebGPU to mitigate the trivial privacy attacks, and we would like to see them in WebNN as well

Bryan: MLTensor heavily relies on the device type
… we have scenarios to re-use tensors between input/output
… I'm wondering how far heuristical control can allow the proper allocation of memory resources
… I fear this will lead to unnecessary reallocations/copy

asully: the MLTensor explainer has an open question; an MLTensor is tied to an MLContext, tightly bound to a device
… if we change this, we will have to change the semantics of MLTensor, similarly to an MLGraph
… so there are solutions for that if we rescope the tensor to the graph

RafaelCintron: how do you share an input/output tensors in that situation?

asully: that might lead to data copies

ningxin: facial recognition typically uses face detection then recognition through 2 different models

asully: there may need to be a way to declare that graphs share buffer

reillyg: if we switch to only power preference and possibly a prefer-cpu setting
… and a WebGPU device for interop
… ensuring consistent data sharing with the GPU,
… then it's up to the UA to deal with data placement
… not forcing developer decisions on graph placement if they don't have to sounds like an improvement
… I think the UA can make the right choice

asully: if it has the right information, yes

reillyg: if you create two graphs in a single context, this would hint they should be run close

kenji_baheux: there may be cases where you want to avoid using the GPU (e.g. because it's used for higher priority tasks), would there still be a way to indicate that?

anssik: which issue should we continue this discussion in?

asully: #749 is a good candidate

<gb> Issue 749 MLContextOptions.deviceType seems unnecessary outside of conformance testing (by mwyrzykowski) [device selection]

reillyg: #302 also exists, but is more vague

<gb> Issue 302 API simplification: context types, context options, createContext() (by zolkis) [v2] [device selection]

Customer feedback & collaborations

Universal Large-Language Model Deployment with ML Compilation

Slideset: https://lists.w3.org/Archives/Public/www-archive/2024Sep/att-0004/MLCTalk.pdf

reillyg: are you looking at implementing support for compiling to the WebNN API? Any feedback on the API capabilities and what you might need to support it as a backend?

tianqi: so far we've been focusing on WebGPU backend; interop between WebNN & WebGPU would help ensure one doesn't block the other
… we've been looking at getting the compiler to generate JS using the WebNN API

reillyg: re WebGPU & WebNN, is your goal to implement non-WebNN operators using WebGPU?

tianqi: we want to be able to partition the tasks flexibly across the two

Mike: as you continue your work toward adopting WebNN, it would be great to provide feedback to the group, incl comparison with other backends in terms of performance

tianqi: +1 ; would love to see more contributions; WebGPU has been very useful to us, would be great to see the same with WebNN

anssik: the open source projects can be found under the mlc github org

MLC-LLM blog (Jun 7, 2024)

WebLLM blog (Jun 13, 2024)

https://discuss.tvm.apache.org/t/unity-tutorial-tvm-unity-byoc/14561

[Unity][Tutorial] TVM Unity BYOC

Transformers.js WebNN backend

[@@@ pre-recorded video by Joshua Lochner]

Slideset: https://lists.w3.org/Archives/Public/www-archive/2024Sep/att-0003/WebML_WG_-_Transformers.js_update__23_September_2024_.pdf

ningxin: webnn support is a planned feature of transformers.js v3, with the upcoming origin trial an opportunity to get feedback

mccool: should we prioritize discussion on dynamic shapes based on that input

ningxin: developers can override the dimensions to adapt e.g. to the camera size
… support for static k-v cache is an open issue in the Transformers project afaik

ONNX Runtime Web & WebNN EP

ningxin: one topic covered by MLTensor is the capability to support ONNX model with external data
… weights are kept in an external data file
… we're enabling the WebNN Execution Provider to support that
… we want to reduce peak memory consumption since they're very high and sometimes hit the limits
… there are different approaches under discussion to solve this, up to streaming directly network data to the memory
… ONNX with external data is a significant use case for supporting models with big weights

RafaelCintron: the ONNX backend team is happy for the contributions; they expressed some concerns for the lack of dynamic shapes in WebNN

McCool: the MLTensor prototype has strong typing which may impact the ability to stream data

reillyg: for CoreML / TFLite, the model is essentially streamed to a file

McCool: allowing streaming from disk is useful to limit the impact on system memory

Google Chrome Feedback revisited

<gb> Issue 453 Google Chrome Feedback on WebNN: aiming for broad device coverage and maintainability (by vsekhar) [process] [opset] [use case]

<McCool> Re streaming, see for example G10; another is NanoFlow

reillyg: a year ago, as we started looking at WebNN, we provided a high level set of feedback on the API
… most of this feedback is either already integrated, tracked in other issues, or will be answered during Origin Trial
… so we're OK with closing issue #453; we're happy to see progress of implementations across platforms
… we'll see how well developers can leverage it cross platforms during Origin Trial

asully: still some skepticism around the long term viability of high level operators

reillyg: (this is tracked in a specific issue)

anssik: could you link these specific issues from #453 and then close the issue?
… Thanks for having provided that feedback with the clarity on setting goals for the work

reillyg: we've seen the progress we want to see on the spec and our issues, so this encompassing issue no longer feels needed

Other standards positions

#763

<gb> Issue 763 Request standards positions from Mozilla and WebKit (by reillyeon) [process]

Anssi: Mike, can you get a webkit standards position on WebNN?

Mike: will do

Anssi: the more actionable feedback would be, the better

Mike: we're reviewing the specification; the DeviceType was the main thing we had found objectionable
… we still need to do more work of mapping to our data framework
… work in progress
… I'll post a request for a standards position for webkit

Anssi: the Mozilla rep Tarek isn't at TPAC this year, but we can work with him to file this

Dom: we can also file this as a WG

RafaelCintron: speaking for Edge, we're fully supportive of the work

reillyg: we're supportive of the work; can't commit to shipping yet, but are looking forward to lessons from the origin trial

Interop and cross-group coordination

Interop issues across different backends

open "interop" issues

#739

<gb> Issue 739 Limited support for pad on CoreML backend (by philloooo) [operator specific] [interop]

#180

<gb> Issue 128 WebNN should support int8 quantized models (by wchao1115) [v2] [opset] [feature request]

#180

<gb> Issue 180 Should dilated pooling be supported (by fujunwei) [operator specific] [interop]

ningxin: one category I want to highlight is handling failure behavior
… e.g. out of bound indices for gather/scatter
… out of bound errors may create memory issues

#486

<gb> Issue 486 Add "implementation consideration" about how out-of-bound indices of Gather/Scatter should be handled (by huningxin) [operator specific] [interop]

ningxin: what should be the behavior be in this siutation? the underlying platforms may have different approaches (error, clamping, normalized)
… some native APIs mention this behavior as simply undefined
… (beyond memory safety)

#691

<gb> Issue 691 Divide-by-zero outcome should be standardized (by huningxin) [interop]

ningxin: in some cases, this may vary across hardware vendors
… #487

<gb> Issue 487 Should `axes` be required parameter for layerNormalization build method? (by huningxin) [operator specific] [interop]

ningxin: #481

<gb> Issue 481 Should `scale` and `bias` be required inputs for `batchNormalization` op? (by huningxin) [operator specific] [interop]

ningxin: these two issues are about the optional attribute - we set a default value for them, but the actual default may vary across platforms
… for optional operands of #481, e.g. batchNormalization
… when they're not present, the implementation has to provide a default value which increases the complexity of implementation
… the question is whether we should make them require and leave the cost to the framework
… some native platforms support the optional operand concept, e.g. coreML
… I propose to close #383

<gb> Issue 383 Need to restrict the value of alpha to be positive for elu operation (by lisa0314) [operator specific] [interop]

ningxin: with the input we got, we concluded that there is no such restriction
… coreML and directML docs say so explicitly
… TFLite doesn't support the alpha parameter but can be emulated
… we already proposed to close #324

<gb> Issue 324 Simplify the operand layout support of conv2d and pooling 2d operations (by huningxin) [feature request] [operator specific] [interop]

anssik: any objection to close #383?
… so we can close it

asully: we should try to avoid as much implementation defined behavior in the spec as possible
… in particular for situations that would end up with very different behaviors across platform e.g. #486

<gb> Issue 486 Add "implementation consideration" about how out-of-bound indices of Gather/Scatter should be handled (by huningxin) [operator specific] [interop]

<Zakim> reillyg, you wanted to discuss defaults.

asully: we should define a behavior implementable across platforms, even if it comes with some wrapper cost in some platforms

reillyg: the baseline is preventing any out of bound memory
… the spec should define a common behavior

reillyg: +1 to avoiding implementation defined behaviors
… default in the graph builder API based on developer ergonomics
… there are cases where explicitly choosing NOT to have defaults provide a better developer experience
… this should be looked at on an op per op
… we shouldn't simply inherit the backend default

RafaelCintron: +1 to both

<RafaelCintron> https://registry.khronos.org/webgl/specs/latest/1.0/#4.1

<RafaelCintron> https://gpuweb.github.io/gpuweb/#security-shader

RafaelCintron: if we can do it performantly, we should do it
… both WebGL and WebGPU have very similar things
… WebGL leaves implementation flexibility in how to deal with out of bound situations
… webGPU has also a list of potential behaviors
… for us, if we can clamp performantly, that's the best

reillyg: the priority of constituencies should be security, conformance and performance-if-it-really-matters

McCool: looking at a couple of cases: scattering out of bounds doesn't matter; but gathering out of bounds is problematic - we should discuss how it is handled

asully: re throwing run time exceptions, there is no cross platform to throw that kind of exceptions, e.g. from a GPU
… I would be more supportive of using default values

Dwayne: is there any precedent for this we could use?

reillyg: for divide by zero, there was a suggestion to look at what different hardwares do

asully: we probably want to ensure we always avoid runtime exception

reillyg: we need to measure the perf impact of clamping

dwayne: for out of bound, the two options are clamping or returning 0/NaN

McCool: we should clarify what platforms mean by undefined behavior for scatter (is it undefined order?)
… non-deterministic atomic order

ningxin: for non-determistic hardware, should we simply ensure safety but not try to define something beyond?

reillyg: we should see if that introduces a fingerprinting surface

mccool: which would be really expensive to fix performantly

Core operator set

#573

<gb> Issue 573 Core operator set (by philloooo) [question] [opset]

reillyg: it seems like there is a minmimum core operator set we can define (with data types) with some additional research
… there is an underlying question between high-level/low-level operators (the former being decomposable into the latter)
… but that can be dealt as with compose that core operator set following our discussions on opLimits
… what we consider core will evolve based on what we see as needed by modern models
… we should have a good definition of what it takes to add an operator to the spec, including to the minimum supported set

Adding new ops

dwayne: a primitive is one that cannot be further decomposed

reillyg: but do we want to add all those that cannot be decomposed?
… criteria for inclusion should be "can be implemented on at least two platforms across several data types"

dwayne: I want to support to be proactive in adding cross-platform available operators

reillyg: we should look also at TOSA, LINALG

ningxin: these two are good targets since they're used by hardware vendors as target

NeilT: we should look at MLRI

asully: the other side of the question is what to do with the non-core operators

jsbell: how do we reflect that in practice in the spec? do we categorize operators?

reillyg: @@@

dom: re beyond core, I think this will be driven by the implementation pressure to limit operators to what actually provide performance boost to enough platforms / Frameworks

reillyg: getting implementation feedback across platforms and engines will provide useful push-and-pull
… if it can be implemented by two different engines, it can be in the spec; if it can be implemented everywhere, it can be in the core set
… it's likely that any op can be implemented everywhere, but not for all data types
… that's where the constraints come in

anssik: implementability across backends may be more critical than across engines

reillyg: re high-level vs decomposed low-level operators, this will likely be based on collecting performance data in practice

ningxin: the high level ops comes from optimized support in some of the native platforms
… but this could be replaced by the backend compiler detecting and applying the optimized path

dom: for some operations, there may also be an engine specific stance on implementability for other reasons (e.G. fingerpriting)

MLTensor

PR #754

<gb> Pull Request 754 Add MLTensor explainer (by a-sully) [webgpu interop]

asully: open to merge this explainer soon, although the recent discussion on mldevicetype will impact this
… since it affects the buffer allocation
… the big open questions are about WebGPU interop
… how to share a buffer between WebNN & WebGPU (e.G. for video feed processing)
… how efficient can this be made? on systems where this can be expressed as GPU commands, this is simple
… when it's not possible, we might have to use the CPU for synchronization (with a performance hit)
… the goal should be the UA has enough information to appropriately allocate the buffer

dom: I don't think we should block on the device type discussion before merging the explainer

RafaelCintron: the GPUExternalBUffer as a type is still an open question (in comparison to GPUBuffer)

asully: it would have more restrictions than a simple GPUBuffer
… I'd be happy to simply use GPUBUffer

McCool: rather than talking about device types, we should talk about which graphs they're communicating to

RafaelCintron: so long as we don't regress scenarios where connecting several graphs end up creating copies
… being able to ensure graphs use the same resource domains

reillyg: with the device type discussion we had earlier, my proposal is that MLContext becomes that resource domain
… if you create an MLContext and create a bunch of graphs, they should share that resource domain, and the buffers should be accessible to these graphs
… are there really cases where you would have buffer before having your models ready to execute?

RafaelCintron: there is at least the constant buffer case as an exception; but for input/output you're probably right

reillyg: yeah, I would keep the MLConstant operand upload as a separate topic

ningxin: there are complex scenarios where a model output can be used by two models that are running in different devices

reillyg: an MLTensor can only be used by graphs created by the same context today
… we can move them together from one unit to another, but they're not expected to be split
… an MLCOntext represent resources that can be cheaply shared with each other

#760

<gb> Issue 760 Support building graphs from `MLTensor` containing constants (by bbernhar) [feature request]

Anssi: so let's proceed with merging the explainer and iterate from there

reillyg: I'd like to understand the relationship between this and the work that Austin has been doing in the Chromium prototype to eagerly upload constants to the GPU process
… if the goal is to allow to stream constants into the graph builder before you call build
… then that's an implementation detail that can be added to the existing constant function
… what that doesn't cover is a case of a constant being reused by multiple graphs
… is that your specific use case?

BryanB: correct

reillyg: if you're compiling the graph with the constant values, that compilation may optimize the constant values and change it - can it still be shared then?
… each graph might optimize operators differently and use different constant values

BryanB: the unoptimized copy might be re-used for another build

RafaelCintron: the reason we want to do this specifically is because we found scenarios where buffers needs to be shared across graphs
… this came up in a discussion with a MLBuilder that would be used to create two graphs
… (which is the alternative approach to defining a new MLTensor type)

McCool: constant operands would also be useful to cache a model across different contexts
… if I have a big model that I want to run in a WebGPU execution *and* in a WebNN execution
… I wouldn't want to have to re-download the model

reillyg: there are two parts: sharing constant data between a graph you built on WebNN in a Web site, on WebGPU in another site (beyond SOP limitations), we already have an explicit assumption that the graph building can create an optimized copy destroying the original

McCool: how important is it to get introp between WebNN & WebGPU implementations?

asully: different platforms have different handling of constants (e.G. they're handled as a file on CoreML)
… MLTensor would be used for input/output; sharing that data between WebGPU and WebNN is out of scope

reillyg: right - they would have to handle as input, which may come with a performance cost

ningxin: this emerged when we restricted MLBuilder to use a single graph
… the reason for this was to ensure the timely release of resources
… doesn't destroy help with resource lifecycle?

reillyg: the ability to constraint how resources get consumed by frameworks is another optimization this allowed

reillyg: DirectML will copy any constant you'll give it
… if we started eagerly copy constants to the GPU process and have them in the processing pipeline ASAP, would it help DirectML?

RafaelCintron: if you make this "owned by DirectML"…

reillyg: in the coreml backend, we would stream constants in the weights file
… we could reuse the same file for multiple graphs

asully: so I Think we can confirm there is a use case to allow re-use of constants across graphs

reillyg: e.g. an optional parameter to the creation of constants

MLConstantOperand

issue #668 and PR #747

<gb> Issue 668 Do we need an `MLConstantOperand`? (by a-sully) [question] [interop]

<gb> Pull Request 747 Introduce MLConstantOperand (by a-sully)

<jsbell> I haven't been to the Anaheim Packing District but I've heard good things about it. I was hoping to check it out.

reillyg: a couple of pieces to this: whether we want to encode the constant parameter requirement in the API (a note on the bias parameter for convolution saying it must be a constant, either through a dedicated type or with a property on the parameter)
… we could specify that implementations do constant folding - if you take a constant and pass it to the add node with another constant, the implementation would take care of making the result a constant for the backend framework
… having an encouragement to provide a constant for these parameters and an ability for the implementation to compute the contantness of a parameter would provide both interop and performance benefits

dwayner: what does it mean to be a constant to coreML? is it limited to CPU or also on GPU?

reillyg: it needs to be present at the time the graph is created

dwayner: could it be created as an MLTensor?

asully: the content of the constant needs to be known at compile time

dwayner: I see the value of being able to query a constantness property, whether it's required or for perf improvement
… I'm not sure it needs to be exposed to the API
… I'm not entirely confident that constant folding would solve all the cases I saw in my research
… the emulation in mos cases would be adding one op
… I would like to minimize the number of cases where we require constantness

reillyg: this ties in with the question of sharing constants
… my intuition would be to assume constantness until we find a reason not to
… it's easier to change a requirement from const to non-const than the reverse

dwayner: but again, except for lstm and gru, it's only one op to emulate

reillyg: if it's only limited to a small number of coreml ops, I agree we could just decompose

asully: all the coreml ops that require constantness are high level ops that need decomposition in the spec in any case

Implementation plans and trials

anssi: we're iterating at CR stage, a call for implementation; we have implementations across 3 back ends, one of which implementing the full API and other backends a bit behind
… multiple OS, with the API behind a flag in chrome and edge

Mike: we would be interested on data which set models run on NPU across a wide range of devices, vs falling back on CPU/GPU

reillyg: is there a way to detect on coreml whether something ran on the NPU?

Mike: I think so but will double check

anssik: figuring out the right metrics for the origin trial is important

rafael: perf comparison, compilation time, top 40 operators and spread of usage, contexts loss, memory usage
… we need to have an API that can remain stable for a few months to collect useful data
… the WebGPU OT lasted several months, with multiple breaking changes which wasn't ideal

reillyg: the expectation is that we're not going to ship at the conclusion of the OT, it's a data gathering experiment and that the API woudl be turned off at the end of the period
… we expect that most developers will be using frameworks which will get back to using CPU or GPU delegates after the period

RobKochman: we have to think through what the developers would actually do and what success would look like for them

McCool: do we want to collect data on which models?

rafael: we wouldn't know it from telemetry but through surveys

jsbell: we want developers to do A/B testing across the different execution providers, since I'm not sure we could tease that out on our end

mccool: maybe the frameworks could help with the A/B testing

Advancement on the W3C Rec Track

Wide review status

w3ctag/design-reviews#933

<gb> CLOSED Issue 933 Updated review of WebNN API (by dontcallmedom) [Priority: urgent] [Progress: review complete] [Review type: small delta] [Review type: horizontal review] [Venue: WebML CG] [Resolution: satisfied with concerns] [Mode: breakout]

<gb> … [Topic: Machine Learning] [Focus: Web architecture (pending)] [Focus: Security (pending)] [Focus: Privacy (pending)]

Anssik: I suggest we integrate the responses Rafael and reillyg gave as non-normative text in the spec
… the TAG is also asking about future-proofing the API against hardware evolution

reillyg: clearly we have thought about it, we still need implementation feedback to determine whether our solution works

anssik: the TAG is also asking multi-platform/multi-engine implementations; we have multiple backend implementation, all major browser vendors in the group, 3-ish OS support

rafael: +1

W3C “living standards” expectations

dom: there's no "living standard" stamp at W3C, you can remain at CR stage as long as the WG is operating
… either you stay at CR and every two years you publish a CRS and go through wide review
… iterate on 2-year cycle
… more traditional path is to go to Recommendation status, requires going from CR to Recommendation demonstrating interop exprience across all features in 2 or more impls
… rarely everything is perfect, but you need to demonstrate the standard is interoperable at this stage
… my personal perspective is that going that final step as painful it is to ensure convergence
… reflects what end user need from the technology
… WebNN now has non-significant number of ops, the risk if we stay iterating at CR is we never take a knife to carve out those ops that get enough implementation experience
… two engines willing to ship across OSes, backends
… without going to REC, we could iterate on CR

anssik: WebRTC Recommendation experience?

dom: adding post corrections is cheap from process perspective
… I'm driving this in WebRTC WG where we made sufficient progress that can say it works well for that group and could work here too

jsbell: thanks dom this is what I wanted to learn
… we don't need to make this decision know

Incubations

Custom ops

Slideset: https://lists.w3.org/Archives/Public/www-archive/2024Sep/att-0007/Tensor_Primitive_Ops_Proposal_-_TPAC.pdf

<gb> Issue 559 Control flow operations: if, while (by philloooo) [opset] [feature request]

<Zakim> reillyg, you wanted to ask whether any current backends support building custom ops using a subgraph.

reillyg: providing more structure to the underlying compiler is likely to produce good results
… however the current backends we've been prototyping with provide this current support

ningxin: I think CoreML does

dwayne: this maybe could be made to work with DirectML

reillyg: but this would require pattern matching

dwayne: but a very simple pattern matching

reillyg: but it risks create performance cliffs; a smarter compiler would make me feel more confident

asully: from a Web platform perspective, the question is whether we need to be able to provide hints e.g. reusable subgraphs, that can be passed to the lower level
… where that is implemented should remain an implementation detail

jsbell: very cool proposal

<jsbell> https://openxla.org/stablehlo/spec#composite

jsbell: the stable HLO approach is very similar and may be a way to annotate the subgraphs

asully: there are aspects of the stable HLO compositor we wouldn't want to expose to the Web
… we wouldn't want to have magic values everywhere - we would have to consider the decomposition, not doing string matching

gdti: cool proposal, two points: 1 minor: the PoC is comparing the WASM built-in function to standards WASM, so the perf isn't representative of the fallback path given the cost of going from WebNN to WASM
… from a Web platform perspective, exposing fully this to the Web might be too challenging to maintain

asully: with this custom op proposal, would this open the way to remove the specification of many of hte WebNN high level operators which can be expressed into lower level WebNN ops?
… I'm supportive of making the API more focused on lower level APIs à la MLIR

Built-in APIs for translation and prompting

Slideset: https://lists.w3.org/Archives/Public/www-archive/2024Sep/att-0008/TPAC_2024_Built-in_AI_APIs.pdf

<kenji_baheux> Minor correction, the 13 partners [...] numbers are for what preceded the early preview program (a more tight early exploration with a few partners). The early preview program has orders of magnitude more participants.

<Domenic> https://docs.google.com/presentation/d/1QeVJ6gsE8_xy2Yui1KcTB75dsiKTI0dDFD7mgLITnmE/edit?usp=sharing

ningxin: can a developer expect to get the same level of performance from these tasks API and WebNN?

domenic: the task API would probably hit more directly the hardware so might have a bit a perf advantage, but I assume the goal for WebNN is to make this imperceptible

Model management

Slideset: https://lists.w3.org/Archives/Public/www-archive/2024Sep/att-0015/WebML_Discussion_-_Hybrid_AI_for_the_Web_-_AI_Model_Management_TPAC_2024_Breakout.pdf

[Slide 2]

[Slide 12]

(Alternative 5)

<kenji_baheux> also, if it's rare enough the likelihood of benefitting from sharing it across origin should be low (for most users).

McCool: I will be presenting a breakout session on this on Wednesday.

Wrap up

anssik: thank you for your active participation and great discussions
… interested folks are welcome to join us for a dinner at Anaheim Packing District 2.5 miles from the meeting venue

https://www.anaheimpackingdistrict.com/merchants

– DRAFT –
Web Machine Learning WG F2F – 23 September 2024

24 September 2024

Attendees

Meeting minutes

Welcome

Charter orientation

Ethics

Democratizing Human-Centered AI with Visual Explanation and Interactive Guidance

Spec orientation

New features

A refreshed analysis of popular models

Quantization and dequantization

Platform capability detection

Future-proof device selection abstractions

Customer feedback & collaborations

Universal Large-Language Model Deployment with ML Compilation

Transformers.js WebNN backend

ONNX Runtime Web & WebNN EP

Google Chrome Feedback revisited

Other standards positions

Interop and cross-group coordination

Interop issues across different backends

Core operator set

MLTensor

MLConstantOperand

Implementation plans and trials

Advancement on the W3C Rec Track

Wide review status

W3C “living standards” expectations

Incubations

Custom ops

Built-in APIs for translation and prompting

Model management

Wrap up