Meeting minutes
<gb> Issue 25 WebML WG - TPAC 2024 agenda (by anssiko)
Repository: webmachinelearning/webnn
Welcome
<Rachel> p+
anssik: this is our first F2F as a WG, despite us having existed for a long time
… I'm Anssi Koistiainen from Intel, chair of the WG, supported by Dom as our staff contact; thanks to the TPAC organizers to make this happen
… great to see both long time participants and new faces
… this WG has now all the major browser vendors as participants, with Mozilla joining recently
… a diverse set of experts at the intersection of expertises around AI, from different backgrounds
… incl library makers that help us calibrate our work to real world requirements
… people from industry, research background - let other people know they can get involved
[round of intros]
<Rachel> Rachel Present+ Rachel_Yager
Charter orientation
Anssi: we are 2 groups: the Web Machine Learning Working Group, and its eponym Community Group
… the WG deliverables includes the WebNN API (the core focus of the technical work), ethical guidelines (a topic on which we will get an intervention from openAI)
… and work on model loader API which is blocked on standardized model format
… the CG is responsible for incubating proposals some of which may graduate later to standardization
Dom: note that the WG charter is expiring next year, so we'll need to start discussions about potential additions in the next few weeks/months
Anssi: the CG allows to do exploratory work - that's how WebNN itself started
Anssi: We're also chartered to coordinate with other groups: WebGPU WG (with related topics on our agenda, e.g. MLTensor)
… the WebAssembly WG is also an important related group (with Deepti in particular helping coordinate)
… also important integration questions around WebRTC which Ningxin explored a couple of years ago
… Also working closely with the Technical Architecture Group which helps us making sure our API fit well in the broader platform
Ethics
Democratizing Human-Centered AI with Visual Explanation and Interactive Guidance
Anssi: please welcome Jay, a safety researcher at openAI who will explain to us his research focus on making AI more accessible through novel interfaces
… You're also part of Georgia Tech, have published many papers and open source tools
[demo of https://
anssi: thank you for this very comprehensive presentation
… a specific intersection with our work you alluded to is possible integration of some of these tools in browser developer tools
Kenji: how hard is to get the model to explain its behavior (e.g. in the example in Gam Changer around age/risk)
Jay: this particular model was a simple regression model where it is easier to identify the particular source of the model behavior
Rachel: why are human hands so problematic to AI image generators?
Jay: the geometry of hands have been really hard to capture for models, but they're improving
McCool: re WebNN, is there any gap related to your work?
Jay: my tools are mostly based on TensorFlow.js
… there may need to be different modalities of input, with different ways of embedding the vectors, to cater with the emerging needs from generative AI
Anssi: thanks again Jay, we hope to work more with you
Spec orientation
<gb> Issue 375 Support for transformers (by dontcallmedom) [v2] [opset]
<gb> Issue 559 Control flow operations: if, while (by philloooo) [opset] [feature request]
<gb> Issue 666 Reconsider `MLOperand` methods (by a-sully) [question]
https://
<reillyg> I propose closing webmachinelearning/
<gb> Issue 11 Executing operations (by anssiko) [feature request]
<McCool> two things that I noticed: int64 issues (relates to interop) and use of constants with MLTensor (enable potentially useful mechanism for model management)
<reillyg> I propose closing webmachinelearning/
<gb> Pull Request 754 Add MLTensor explainer (by a-sully) [webgpu interop]
<gb> Pull Request 541 Add MLBuffer exploration doc (by a-sully) [webgpu interop]
<jsbell> Propose closing this group of "simplify" issues unless someone strongly advocates for them soon: #474, #470, #374, #324
<gb> Issue 374 Simplify `MLPool2dOptions` by removing the `outputSizes` option (by huningxin) [operator specific]
<gb> Issue 474 Simplify `resample2d` op (by huningxin) [operator specific]
<gb> Issue 470 Simplify `matmul` op (by huningxin) [operator specific]
<gb> Issue 324 Simplify the operand layout support of conv2d and pooling 2d operations (by huningxin) [feature request] [operator specific] [interop]
<jsbell> (sorry Ningxin!)
<BryanB> Issue 749 MLContextOptions.deviceType seems unnecessary outside of conformance testing
<gb> Issue 749 MLContextOptions.deviceType seems unnecessary outside of conformance testing (by mwyrzykowski) [device selection]
<reillyg> The most recent ~5 issues don't have labels. I don't have permission to add them.
<Dom7> reillyg: it seems that issue #11 was opened a long time ago and seems like it can be closed
<gb> Issue 11 Executing operations (by anssiko) [feature request]
anssik: no one objecting to #11 getting closed, so let's do it
reillyg: there are two open pull requests on the MLTensor space; can we close the generic one and keep only the specific one? can we land the PR for the explainer?
Austin: I'll close #541
<gb> CLOSED Pull Request 541 Add MLBuffer exploration doc (by a-sully) [webgpu interop]
anssik: we should look at merging the explainer after our MLTensor discussion later today
jsbell: propose closing this group of "simplify" issues unless someone strongly advocates for them soon: #474, #470, #374, #324
… we should either do them soon or abandon them
ningxin: I think we can close #324
reillyg: in the coming implementation, I've added automatic transposes
… I think we can do without that particular simplification
anssik: so let's close #324
<gb> Issue 324 Simplify the operand layout support of conv2d and pooling 2d operations (by huningxin) [feature request] [operator specific] [interop]
<Domenic> (Has 2d vs. 2D been discussed? https://
anssik: re #470, do we need to retitle it? open a different issue?
<gb> Issue 470 Simplify `matmul` op (by huningxin) [operator specific]
ningxin: I'll retitle it to reflect its status
dwayner: I'll open a new issue instead, linking back to that one
<BryanB> I think we can close the old MLBuffer PRs in favor of the MLTensor explainer: #542 #543 #544
<gb> Issue 544 [MLBuffer] Support for MLBuffer in graph execution (by bbernhar)
<gb> Issue 543 [MLBuffer] Uploading/downloading tensor data (by bbernhar)
<gb> Issue 542 [MLBuffer] Creation and representing MLBuffer on a XPU devices (by bbernhar) [webgpu interop]
dwayner: re #374, we should probably align pool with conv - I'll propose next step in the issue
<gb> Issue 374 Simplify `MLPool2dOptions` by removing the `outputSizes` option (by huningxin) [operator specific]
ningxin: closing #474 SGTM
<gb> Issue 474 Simplify `resample2d` op (by huningxin) [operator specific]
New features
A refreshed analysis of popular models
<gb> Issue 375 Support for transformers (by dontcallmedom) [v2] [opset]
Slideset: https://
33 Models, 12 Operators, proposed IDL, data types
<gb> Issue 375 Support for transformers (by dontcallmedom) [v2] [opset]
<gdti> Quick note: Memory64 for Wasm is available both on Chrome & Firefox nightly behind a flag for the last year or so - they're really close to being enabled by default and the proposal is stable just pending a phase 4 poll on closing out the last few spec issues
<gdti> We'd love folks to try it out and let us know if something isn't working as expected
ningxin: we have already some implementation experience with these
<Zakim> anssik, you wanted to propose next steps for #375 and to discuss priority of #559 and to bump the priority of #666
<gb> Issue 375 Support for transformers (by dontcallmedom) [v2] [opset]
<gb> Issue 559 Control flow operations: if, while (by philloooo) [opset] [feature request]
<gb> Issue 666 Reconsider `MLOperand` methods (by a-sully) [question]
dwayne: it has informed some of the proposals
anssik: I'm interested in general feedback on this approach to wave 3 operators, given current prototyping efforts
ningxin: scatterND helps with performance, not just functionality
jsbell: I ♥ the wave nomenclature - it may be useful to use it for our issues in the repo
… how far are we along for the implementation? any sense of when impl/spec will be ready to advance to origin trial?
dwayne: for ops completeness in the DML backend, maybe two weeks of implementation work
… I hope to add the ops to the spec in the same timeframe
McCool: how many of these models are actually useful in the Web context?
dwayne: huge models are good to demonstrate viability, but they're so big they're likely not practical to use directly in the browser
McCool: +1
ningxin: they were identified through popularity in transformers.js, so already used in the browser context
NeilT: any consideration of the ARM TOSA operator?
dwayne: I've looked at it and have the data on it that I'll share
anssik: hearing overall support for the approach; no specific plan for origin trial yet
… in terms of spec, do you expect any specific challenge?
dwayne: it should be pretty straightforward given our experience; a few interesting questions around invalid index conditions
ningxin: int4/uint4 data types will need attention and wide review
Quantization and dequantization
<gb> Issue 93 Add QuantizeLinear and DequantizeLinear for mixed precision (by kpu) [opset] [feature request]
<gb> Issue 128 WebNN should support int8 quantized models (by wchao1115) [v2] [opset] [feature request]
<gb> Issue 623 WebNN should support NPU and QDQ operations (by wchao1115) [v2] [opset] [feature request] [device selection]
reillyg: adding operators to explicit dequantize int8 and int4 vaues to float16/float32 makes a lot of sense to add to the spec
… an open question: do we agree that explicit dequantization is the approach we want to take?
… backends can detect a dequantize/conv2d operation, so expression this in the API itself is maybe unnecessary
dwayne: that matches the experience in ONNX
ningxin: quantize/dequantize makes it also easier to fallback
… for the backends
reillyg: if we expect that backends may strip out q/deQ pairs, how do we want to specifiy the behavior of rhte full graph from a precision perspective?
… how useful would the requantize operator to be?
dwayne: I have seen hundreds of models using the quantize operator
ningxin: the behavior is up to the implementation, but how do we specify this?
dwayne: does it change the overall precision?
reillyg: in TFlite, there is a top-level flag to control this
jsbell: one other approach we've seen, the scale is attached to the tensor
… I'm assuming we want to be explicit and not pursue this, but want to make sure we know of the options
reillyg: this also leads to a huge explosion of the type system
dwayne: I'll go with explicit
reillyg: if we support quantize/dequantize, what data types do we support? int8/uint8 seem obvious, int4/uint4 are more complicated
… representing int4 on the Web and across backends is challenging
dwayne: from a Web API perspective, they would be exposed as Int8Array
… from an implementation perspective, I'm not sure how to handle int4 as input
reillyg: quantization is most useful for weights; do we know of any need of it for input/output?
McCool: I've seen quantization for activation as well
reillyg: i.e. using it as another kind of an activation function
… I've linked to the 3 related issues - should we triage them into a single issue?
… either #93 or #128 (#623 as a bunch of unrelated aspects)
<gb> Issue 623 WebNN should support NPU and QDQ operations (by wchao1115) [v2] [opset] [feature request] [device selection]
<gb> Issue 128 WebNN should support int8 quantized models (by wchao1115) [v2] [opset] [feature request]
<gb> Issue 93 Add QuantizeLinear and DequantizeLinear for mixed precision (by kpu) [opset] [feature request]
reillyg: I'll clean them up now that I have the proper repo privs
Platform capability detection
reillyg: there is a bit of overlap between this topic and the next one on future proof device selection
… capabilities detection and device selection go hand in hand
… capabilities depend on the platform you're on, and which device you pick
… I was looking at the examples of how WebGPU handles this; in WebGPU, the first step the dev goes through is requesting an adapter at which point the system devices which adapter to use
… the question raised in Mike's proposal is whether we can have the developers give us the set of features they want and whether we can fulfill that
Mike: in WebGPU you get a set of limits, with defaults but also maximum limits
… the defaults are guaranteed to run everywhere; if you ask something above the defaults, you can run on the particular device, but no guarantee to run everywhere
… so instead of only describing what the device supports, establishing a baseline of support that can run everywhere
reillyg: it's hard to have a default operator set that works across platforms, mostly because of datatypes support
… it seems it's not possible to give a baseline for models and datatypes with a guarantee to run everywhere
… any framework built on top of WebNN will have to have code to respond to the capabilities of the platform and tailor the graph as it is being built to tailor it to these capabilities
… given the current landscape of hardware support, I'm not sure if it makes sense to create a baseline set
Mike: it would be interesting to see what we're missing from this core set given the support of TensorFlow.js in WebGPU
asully: to get to a common set, one approach would be to relax the requirement that NPU matches with an NPU device
reillyg: with a restriction to GPU, could we identify a baseline operator set?
dwayne: probably
reillyg: if we said "if you're OK using the GPU and do float32, you're OK in the baseline"
ningxin: we have a bunch of ONNX models in our tests; they use int16; we've added support for int16 in opLimits in a PR, with a WASM fallback
asully: we could limit the size of indexes to int32
… to avoid the performance penalty of falling back to WASM
Mike: our goal is to avoid developers not running on platforms by inadvertance
<Zakim> reillyg, you wanted to discuss whether requesting additional operators up-front is practical.
Mike: we want to make sure it happens with a clear intent from the developers and with a sense they will build a fallback
reillyg: given that this is something that will be intermediated by frameworks, I'm wondering if this is something we want to do through an API or through developer tooling
… e.G. a flag to enable compatibility mode
<McCool> (my comment was to suggest a compatibility mode for testing, now covered)
reillyg: how does framework intermediation apply to the WebGPU case?
Mike: the engines tend to handle it for developers
… a developer tool setting sounds like a good idea
reillyg: the frameworks have a backup (e.g. go back to WASM), and it's relatively easy for them to detect when they go off limit
… we don't want frameworks to guess and check
mike: we could still expose maximum limits, but keep the defaults to a baseline
ningxin: we could also make efforts to promoting best compatible datatypes for the Web to the transformers tooling community
… this would reduce situations where frameworks have to fallback
reillyg: so is the concern with oplimits more of a compatibility rather than fingerpriting
dom: the Web platform is compelling enough as a distribution platform that it can drive convergence toward a baseline
reillyg: what we need to figure out is the minimal supported operator set and data types (assuming GPU execution)
… and with an opt-in parameter to request more than the default
anssik: this might help also with fingerprinting
RafaelCintron: e.g. implementors could not provide this "upgrade" path in a privacy mode
Future-proof device selection abstractions
Slideset: https://
reillyg: getting rid of the explicit device type makes sense
… the choice to me is between being very vague (with power preference) or a bit more specific: CPU-only, CPU+GPU, CPU+NPU, CPU+GPU+NPU
… to avoid compatibility issues with the ambiguity of power preference
Mike: as more WebNN adoption occurs, it would be great to get data on performance improvements actual apps would get if they had more guarantees on which device they run
reillyg: we would like to run further experiments to see if developers could actually target NPUs in a cross-platform compatible fashion given the level of diversity in NPUs on the market today
… so basically agree
Mike: if we had sufficient data to show that device type selection is necessary, we would be more open to it (as CoreML allows)
… our proposal is that initially we remove the concept, with an openness to reconsider it based on data
RafaelCintron: the windows ecosystem is a lot more heterogeneous, so not having device selection feels even harder there
… it's hard for the browser to make a decision since the model is only known once the data is buffered
… maybe this could be done with a different API shape
… Re privacy, how different is it for WebGPU? it seems you could do the same with increasingly complex shader
Mike: you can hide capabilities; you can't fully prevent it, but there are protections that have been added to WebGPU to mitigate the trivial privacy attacks, and we would like to see them in WebNN as well
Bryan: MLTensor heavily relies on the device type
… we have scenarios to re-use tensors between input/output
… I'm wondering how far heuristical control can allow the proper allocation of memory resources
… I fear this will lead to unnecessary reallocations/copy
asully: the MLTensor explainer has an open question; an MLTensor is tied to an MLContext, tightly bound to a device
… if we change this, we will have to change the semantics of MLTensor, similarly to an MLGraph
… so there are solutions for that if we rescope the tensor to the graph
RafaelCintron: how do you share an input/output tensors in that situation?
asully: that might lead to data copies
ningxin: facial recognition typically uses face detection then recognition through 2 different models
asully: there may need to be a way to declare that graphs share buffer
reillyg: if we switch to only power preference and possibly a prefer-cpu setting
… and a WebGPU device for interop
… ensuring consistent data sharing with the GPU,
… then it's up to the UA to deal with data placement
… not forcing developer decisions on graph placement if they don't have to sounds like an improvement
… I think the UA can make the right choice
asully: if it has the right information, yes
reillyg: if you create two graphs in a single context, this would hint they should be run close
kenji_baheux: there may be cases where you want to avoid using the GPU (e.g. because it's used for higher priority tasks), would there still be a way to indicate that?
anssik: which issue should we continue this discussion in?
asully: #749 is a good candidate
<gb> Issue 749 MLContextOptions.deviceType seems unnecessary outside of conformance testing (by mwyrzykowski) [device selection]
reillyg: #302 also exists, but is more vague
<gb> Issue 302 API simplification: context types, context options, createContext() (by zolkis) [v2] [device selection]
Customer feedback & collaborations
Universal Large-Language Model Deployment with ML Compilation
Slideset: https://
reillyg: are you looking at implementing support for compiling to the WebNN API? Any feedback on the API capabilities and what you might need to support it as a backend?
tianqi: so far we've been focusing on WebGPU backend; interop between WebNN & WebGPU would help ensure one doesn't block the other
… we've been looking at getting the compiler to generate JS using the WebNN API
reillyg: re WebGPU & WebNN, is your goal to implement non-WebNN operators using WebGPU?
tianqi: we want to be able to partition the tasks flexibly across the two
Mike: as you continue your work toward adopting WebNN, it would be great to provide feedback to the group, incl comparison with other backends in terms of performance
tianqi: +1 ; would love to see more contributions; WebGPU has been very useful to us, would be great to see the same with WebNN
anssik: the open source projects can be found under the mlc github org
https://
[Unity][Tutorial] TVM Unity BYOC
Transformers.js WebNN backend
[@@@ pre-recorded video by Joshua Lochner]
ningxin: webnn support is a planned feature of transformers.js v3, with the upcoming origin trial an opportunity to get feedback
mccool: should we prioritize discussion on dynamic shapes based on that input
ningxin: developers can override the dimensions to adapt e.g. to the camera size
… support for static k-v cache is an open issue in the Transformers project afaik
ONNX Runtime Web & WebNN EP
ningxin: one topic covered by MLTensor is the capability to support ONNX model with external data
… weights are kept in an external data file
… we're enabling the WebNN Execution Provider to support that
… we want to reduce peak memory consumption since they're very high and sometimes hit the limits
… there are different approaches under discussion to solve this, up to streaming directly network data to the memory
… ONNX with external data is a significant use case for supporting models with big weights
RafaelCintron: the ONNX backend team is happy for the contributions; they expressed some concerns for the lack of dynamic shapes in WebNN
McCool: the MLTensor prototype has strong typing which may impact the ability to stream data
reillyg: for CoreML / TFLite, the model is essentially streamed to a file
McCool: allowing streaming from disk is useful to limit the impact on system memory
Google Chrome Feedback revisited
<gb> Issue 453 Google Chrome Feedback on WebNN: aiming for broad device coverage and maintainability (by vsekhar) [process] [opset] [use case]
<McCool> Re streaming, see for example G10; another is NanoFlow
reillyg: a year ago, as we started looking at WebNN, we provided a high level set of feedback on the API
… most of this feedback is either already integrated, tracked in other issues, or will be answered during Origin Trial
… so we're OK with closing issue #453; we're happy to see progress of implementations across platforms
… we'll see how well developers can leverage it cross platforms during Origin Trial
asully: still some skepticism around the long term viability of high level operators
reillyg: (this is tracked in a specific issue)
anssik: could you link these specific issues from #453 and then close the issue?
… Thanks for having provided that feedback with the clarity on setting goals for the work
reillyg: we've seen the progress we want to see on the spec and our issues, so this encompassing issue no longer feels needed
Other standards positions
<gb> Issue 763 Request standards positions from Mozilla and WebKit (by reillyeon) [process]
Anssi: Mike, can you get a webkit standards position on WebNN?
Mike: will do
Anssi: the more actionable feedback would be, the better
Mike: we're reviewing the specification; the DeviceType was the main thing we had found objectionable
… we still need to do more work of mapping to our data framework
… work in progress
… I'll post a request for a standards position for webkit
Anssi: the Mozilla rep Tarek isn't at TPAC this year, but we can work with him to file this
Dom: we can also file this as a WG
RafaelCintron: speaking for Edge, we're fully supportive of the work
reillyg: we're supportive of the work; can't commit to shipping yet, but are looking forward to lessons from the origin trial
Interop and cross-group coordination
Interop issues across different backends
<gb> Issue 739 Limited support for pad on CoreML backend (by philloooo) [operator specific] [interop]
<gb> Issue 128 WebNN should support int8 quantized models (by wchao1115) [v2] [opset] [feature request]
<gb> Issue 180 Should dilated pooling be supported (by fujunwei) [operator specific] [interop]
ningxin: one category I want to highlight is handling failure behavior
… e.g. out of bound indices for gather/scatter
… out of bound errors may create memory issues
<gb> Issue 486 Add "implementation consideration" about how out-of-bound indices of Gather/Scatter should be handled (by huningxin) [operator specific] [interop]
ningxin: what should be the behavior be in this siutation? the underlying platforms may have different approaches (error, clamping, normalized)
… some native APIs mention this behavior as simply undefined
… (beyond memory safety)
<gb> Issue 691 Divide-by-zero outcome should be standardized (by huningxin) [interop]
ningxin: in some cases, this may vary across hardware vendors
… #487
<gb> Issue 487 Should `axes` be required parameter for layerNormalization build method? (by huningxin) [operator specific] [interop]
ningxin: #481
<gb> Issue 481 Should `scale` and `bias` be required inputs for `batchNormalization` op? (by huningxin) [operator specific] [interop]
ningxin: these two issues are about the optional attribute - we set a default value for them, but the actual default may vary across platforms
… for optional operands of #481, e.g. batchNormalization
… when they're not present, the implementation has to provide a default value which increases the complexity of implementation
… the question is whether we should make them require and leave the cost to the framework
… some native platforms support the optional operand concept, e.g. coreML
… I propose to close #383
<gb> Issue 383 Need to restrict the value of alpha to be positive for elu operation (by lisa0314) [operator specific] [interop]
ningxin: with the input we got, we concluded that there is no such restriction
… coreML and directML docs say so explicitly
… TFLite doesn't support the alpha parameter but can be emulated
… we already proposed to close #324
<gb> Issue 324 Simplify the operand layout support of conv2d and pooling 2d operations (by huningxin) [feature request] [operator specific] [interop]
anssik: any objection to close #383?
… so we can close it
asully: we should try to avoid as much implementation defined behavior in the spec as possible
… in particular for situations that would end up with very different behaviors across platform e.g. #486
<gb> Issue 486 Add "implementation consideration" about how out-of-bound indices of Gather/Scatter should be handled (by huningxin) [operator specific] [interop]
<Zakim> reillyg, you wanted to discuss defaults.
asully: we should define a behavior implementable across platforms, even if it comes with some wrapper cost in some platforms
reillyg: the baseline is preventing any out of bound memory
… the spec should define a common behavior
reillyg: +1 to avoiding implementation defined behaviors
… default in the graph builder API based on developer ergonomics
… there are cases where explicitly choosing NOT to have defaults provide a better developer experience
… this should be looked at on an op per op
… we shouldn't simply inherit the backend default
RafaelCintron: +1 to both
<RafaelCintron> https://
<RafaelCintron> https://
RafaelCintron: if we can do it performantly, we should do it
… both WebGL and WebGPU have very similar things
… WebGL leaves implementation flexibility in how to deal with out of bound situations
… webGPU has also a list of potential behaviors
… for us, if we can clamp performantly, that's the best
reillyg: the priority of constituencies should be security, conformance and performance-if-it-really-matters
McCool: looking at a couple of cases: scattering out of bounds doesn't matter; but gathering out of bounds is problematic - we should discuss how it is handled
asully: re throwing run time exceptions, there is no cross platform to throw that kind of exceptions, e.g. from a GPU
… I would be more supportive of using default values
Dwayne: is there any precedent for this we could use?
reillyg: for divide by zero, there was a suggestion to look at what different hardwares do
asully: we probably want to ensure we always avoid runtime exception
reillyg: we need to measure the perf impact of clamping
dwayne: for out of bound, the two options are clamping or returning 0/NaN
McCool: we should clarify what platforms mean by undefined behavior for scatter (is it undefined order?)
… non-deterministic atomic order
ningxin: for non-determistic hardware, should we simply ensure safety but not try to define something beyond?
reillyg: we should see if that introduces a fingerprinting surface
mccool: which would be really expensive to fix performantly
Core operator set
<gb> Issue 573 Core operator set (by philloooo) [question] [opset]
reillyg: it seems like there is a minmimum core operator set we can define (with data types) with some additional research
… there is an underlying question between high-level/low-level operators (the former being decomposable into the latter)
… but that can be dealt as with compose that core operator set following our discussions on opLimits
… what we consider core will evolve based on what we see as needed by modern models
… we should have a good definition of what it takes to add an operator to the spec, including to the minimum supported set
dwayne: a primitive is one that cannot be further decomposed
reillyg: but do we want to add all those that cannot be decomposed?
… criteria for inclusion should be "can be implemented on at least two platforms across several data types"
dwayne: I want to support to be proactive in adding cross-platform available operators
reillyg: we should look also at TOSA, LINALG
ningxin: these two are good targets since they're used by hardware vendors as target
NeilT: we should look at MLRI
asully: the other side of the question is what to do with the non-core operators
jsbell: how do we reflect that in practice in the spec? do we categorize operators?
reillyg: @@@
dom: re beyond core, I think this will be driven by the implementation pressure to limit operators to what actually provide performance boost to enough platforms / Frameworks
reillyg: getting implementation feedback across platforms and engines will provide useful push-and-pull
… if it can be implemented by two different engines, it can be in the spec; if it can be implemented everywhere, it can be in the core set
… it's likely that any op can be implemented everywhere, but not for all data types
… that's where the constraints come in
anssik: implementability across backends may be more critical than across engines
reillyg: re high-level vs decomposed low-level operators, this will likely be based on collecting performance data in practice
ningxin: the high level ops comes from optimized support in some of the native platforms
… but this could be replaced by the backend compiler detecting and applying the optimized path
dom: for some operations, there may also be an engine specific stance on implementability for other reasons (e.G. fingerpriting)
MLTensor
PR #754
<gb> Pull Request 754 Add MLTensor explainer (by a-sully) [webgpu interop]
asully: open to merge this explainer soon, although the recent discussion on mldevicetype will impact this
… since it affects the buffer allocation
… the big open questions are about WebGPU interop
… how to share a buffer between WebNN & WebGPU (e.G. for video feed processing)
… how efficient can this be made? on systems where this can be expressed as GPU commands, this is simple
… when it's not possible, we might have to use the CPU for synchronization (with a performance hit)
… the goal should be the UA has enough information to appropriately allocate the buffer
dom: I don't think we should block on the device type discussion before merging the explainer
RafaelCintron: the GPUExternalBUffer as a type is still an open question (in comparison to GPUBuffer)
asully: it would have more restrictions than a simple GPUBuffer
… I'd be happy to simply use GPUBUffer
McCool: rather than talking about device types, we should talk about which graphs they're communicating to
RafaelCintron: so long as we don't regress scenarios where connecting several graphs end up creating copies
… being able to ensure graphs use the same resource domains
reillyg: with the device type discussion we had earlier, my proposal is that MLContext becomes that resource domain
… if you create an MLContext and create a bunch of graphs, they should share that resource domain, and the buffers should be accessible to these graphs
… are there really cases where you would have buffer before having your models ready to execute?
RafaelCintron: there is at least the constant buffer case as an exception; but for input/output you're probably right
reillyg: yeah, I would keep the MLConstant operand upload as a separate topic
ningxin: there are complex scenarios where a model output can be used by two models that are running in different devices
reillyg: an MLTensor can only be used by graphs created by the same context today
… we can move them together from one unit to another, but they're not expected to be split
… an MLCOntext represent resources that can be cheaply shared with each other
<gb> Issue 760 Support building graphs from `MLTensor` containing constants (by bbernhar) [feature request]
Anssi: so let's proceed with merging the explainer and iterate from there
reillyg: I'd like to understand the relationship between this and the work that Austin has been doing in the Chromium prototype to eagerly upload constants to the GPU process
… if the goal is to allow to stream constants into the graph builder before you call build
… then that's an implementation detail that can be added to the existing constant function
… what that doesn't cover is a case of a constant being reused by multiple graphs
… is that your specific use case?
BryanB: correct
reillyg: if you're compiling the graph with the constant values, that compilation may optimize the constant values and change it - can it still be shared then?
… each graph might optimize operators differently and use different constant values
BryanB: the unoptimized copy might be re-used for another build
RafaelCintron: the reason we want to do this specifically is because we found scenarios where buffers needs to be shared across graphs
… this came up in a discussion with a MLBuilder that would be used to create two graphs
… (which is the alternative approach to defining a new MLTensor type)
McCool: constant operands would also be useful to cache a model across different contexts
… if I have a big model that I want to run in a WebGPU execution *and* in a WebNN execution
… I wouldn't want to have to re-download the model
reillyg: there are two parts: sharing constant data between a graph you built on WebNN in a Web site, on WebGPU in another site (beyond SOP limitations), we already have an explicit assumption that the graph building can create an optimized copy destroying the original
McCool: how important is it to get introp between WebNN & WebGPU implementations?
asully: different platforms have different handling of constants (e.G. they're handled as a file on CoreML)
… MLTensor would be used for input/output; sharing that data between WebGPU and WebNN is out of scope
reillyg: right - they would have to handle as input, which may come with a performance cost
ningxin: this emerged when we restricted MLBuilder to use a single graph
… the reason for this was to ensure the timely release of resources
… doesn't destroy help with resource lifecycle?
reillyg: the ability to constraint how resources get consumed by frameworks is another optimization this allowed
reillyg: DirectML will copy any constant you'll give it
… if we started eagerly copy constants to the GPU process and have them in the processing pipeline ASAP, would it help DirectML?
RafaelCintron: if you make this "owned by DirectML"…
reillyg: in the coreml backend, we would stream constants in the weights file
… we could reuse the same file for multiple graphs
asully: so I Think we can confirm there is a use case to allow re-use of constants across graphs
reillyg: e.g. an optional parameter to the creation of constants
MLConstantOperand
<gb> Issue 668 Do we need an `MLConstantOperand`? (by a-sully) [question] [interop]
<gb> Pull Request 747 Introduce MLConstantOperand (by a-sully)
<jsbell> I haven't been to the Anaheim Packing District but I've heard good things about it. I was hoping to check it out.
reillyg: a couple of pieces to this: whether we want to encode the constant parameter requirement in the API (a note on the bias parameter for convolution saying it must be a constant, either through a dedicated type or with a property on the parameter)
… we could specify that implementations do constant folding - if you take a constant and pass it to the add node with another constant, the implementation would take care of making the result a constant for the backend framework
… having an encouragement to provide a constant for these parameters and an ability for the implementation to compute the contantness of a parameter would provide both interop and performance benefits
dwayner: what does it mean to be a constant to coreML? is it limited to CPU or also on GPU?
reillyg: it needs to be present at the time the graph is created
dwayner: could it be created as an MLTensor?
asully: the content of the constant needs to be known at compile time
dwayner: I see the value of being able to query a constantness property, whether it's required or for perf improvement
… I'm not sure it needs to be exposed to the API
… I'm not entirely confident that constant folding would solve all the cases I saw in my research
… the emulation in mos cases would be adding one op
… I would like to minimize the number of cases where we require constantness
reillyg: this ties in with the question of sharing constants
… my intuition would be to assume constantness until we find a reason not to
… it's easier to change a requirement from const to non-const than the reverse
dwayner: but again, except for lstm and gru, it's only one op to emulate
reillyg: if it's only limited to a small number of coreml ops, I agree we could just decompose
asully: all the coreml ops that require constantness are high level ops that need decomposition in the spec in any case
Implementation plans and trials
anssi: we're iterating at CR stage, a call for implementation; we have implementations across 3 back ends, one of which implementing the full API and other backends a bit behind
… multiple OS, with the API behind a flag in chrome and edge
Mike: we would be interested on data which set models run on NPU across a wide range of devices, vs falling back on CPU/GPU
reillyg: is there a way to detect on coreml whether something ran on the NPU?
Mike: I think so but will double check
anssik: figuring out the right metrics for the origin trial is important
rafael: perf comparison, compilation time, top 40 operators and spread of usage, contexts loss, memory usage
… we need to have an API that can remain stable for a few months to collect useful data
… the WebGPU OT lasted several months, with multiple breaking changes which wasn't ideal
reillyg: the expectation is that we're not going to ship at the conclusion of the OT, it's a data gathering experiment and that the API woudl be turned off at the end of the period
… we expect that most developers will be using frameworks which will get back to using CPU or GPU delegates after the period
RobKochman: we have to think through what the developers would actually do and what success would look like for them
McCool: do we want to collect data on which models?
rafael: we wouldn't know it from telemetry but through surveys
jsbell: we want developers to do A/B testing across the different execution providers, since I'm not sure we could tease that out on our end
mccool: maybe the frameworks could help with the A/B testing
Advancement on the W3C Rec Track
Wide review status
<gb> … [Topic: Machine Learning] [Focus: Web architecture (pending)] [Focus: Security (pending)] [Focus: Privacy (pending)]
Anssik: I suggest we integrate the responses Rafael and reillyg gave as non-normative text in the spec
… the TAG is also asking about future-proofing the API against hardware evolution
reillyg: clearly we have thought about it, we still need implementation feedback to determine whether our solution works
anssik: the TAG is also asking multi-platform/multi-engine implementations; we have multiple backend implementation, all major browser vendors in the group, 3-ish OS support
rafael: +1
W3C “living standards” expectations
dom: there's no "living standard" stamp at W3C, you can remain at CR stage as long as the WG is operating
… either you stay at CR and every two years you publish a CRS and go through wide review
… iterate on 2-year cycle
… more traditional path is to go to Recommendation status, requires going from CR to Recommendation demonstrating interop exprience across all features in 2 or more impls
… rarely everything is perfect, but you need to demonstrate the standard is interoperable at this stage
… my personal perspective is that going that final step as painful it is to ensure convergence
… reflects what end user need from the technology
… WebNN now has non-significant number of ops, the risk if we stay iterating at CR is we never take a knife to carve out those ops that get enough implementation experience
… two engines willing to ship across OSes, backends
… without going to REC, we could iterate on CR
anssik: WebRTC Recommendation experience?
dom: adding post corrections is cheap from process perspective
… I'm driving this in WebRTC WG where we made sufficient progress that can say it works well for that group and could work here too
jsbell: thanks dom this is what I wanted to learn
… we don't need to make this decision know
Incubations
Custom ops
<gb> Issue 559 Control flow operations: if, while (by philloooo) [opset] [feature request]
<Zakim> reillyg, you wanted to ask whether any current backends support building custom ops using a subgraph.
reillyg: providing more structure to the underlying compiler is likely to produce good results
… however the current backends we've been prototyping with provide this current support
ningxin: I think CoreML does
dwayne: this maybe could be made to work with DirectML
reillyg: but this would require pattern matching
dwayne: but a very simple pattern matching
reillyg: but it risks create performance cliffs; a smarter compiler would make me feel more confident
asully: from a Web platform perspective, the question is whether we need to be able to provide hints e.g. reusable subgraphs, that can be passed to the lower level
… where that is implemented should remain an implementation detail
jsbell: very cool proposal
<jsbell> https://
jsbell: the stable HLO approach is very similar and may be a way to annotate the subgraphs
asully: there are aspects of the stable HLO compositor we wouldn't want to expose to the Web
… we wouldn't want to have magic values everywhere - we would have to consider the decomposition, not doing string matching
gdti: cool proposal, two points: 1 minor: the PoC is comparing the WASM built-in function to standards WASM, so the perf isn't representative of the fallback path given the cost of going from WebNN to WASM
… from a Web platform perspective, exposing fully this to the Web might be too challenging to maintain
asully: with this custom op proposal, would this open the way to remove the specification of many of hte WebNN high level operators which can be expressed into lower level WebNN ops?
… I'm supportive of making the API more focused on lower level APIs à la MLIR
Built-in APIs for translation and prompting
Slideset: https://
<kenji_baheux> Minor correction, the 13 partners [...] numbers are for what preceded the early preview program (a more tight early exploration with a few partners). The early preview program has orders of magnitude more participants.
<Domenic> https://
ningxin: can a developer expect to get the same level of performance from these tasks API and WebNN?
domenic: the task API would probably hit more directly the hardware so might have a bit a perf advantage, but I assume the goal for WebNN is to make this imperceptible
Model management
(Alternative 5)
<kenji_baheux> also, if it's rare enough the likelihood of benefitting from sharing it across origin should be low (for most users).
McCool: I will be presenting a breakout session on this on Wednesday.
Wrap up
anssik: thank you for your active participation and great discussions
… interested folks are welcome to join us for a dinner at Anaheim Packing District 2.5 miles from the meeting venue