Meeting minutes
Repository: webmachinelearning/webnn
anssik: I hope you had a great time at TPAC!
… today, we'll resume our bi-weeklies with a refresher on selected resolutions and proposals from F2F, and discuss our next step
… but first, I'd like to welcome our most recent new participants:
… Talha Gorsi and Robert Simpson from Qualcomm
… Sohum Chatterjee from Microsoft
… welcome to the WebML WG!
WebNN Operator Update Wave 3
RESOLUTION: Update spec with Wave 3 operators, initiate int4/uint4 wide review.
anssik: Dwayne's Wave 3 plan as documented in his slides received group's support
… the editors can now start formulate the Wave 3 changes as a spec PR
… all-in-one PR or multiple smaller PRs as long as self-contained
Dwayne: prefer one PR initially
<ningxin> sgtm
Dwayne: everyone on the call has seen the slides, one interesting update, all Wave 3 ops are in Chromium now
… WebNN EP to validate this Chromium work
… 8 remaining ops out of 12 there
… no new ops in mind beyond Wave 3 for now
anssik: I think the group should check in with the TAG for the int4/uint4 type
… the plan is to expose these 4-bit types through Int8Array using a packing approach since there's no Int4Array in JS
https://
anssik: we could land our int4/uint4 proposal into the spec and ask TAG to review it, no need to block on the TAG review to land the spec PR given we're doing a CR Draft here
McCool: bitwise ops?
Dwayne: need to consider CoreML compat
… MikeW shared that it might be something they could add in the future but could take a while, meanwhile need emulation path
… via BNNS or MPS
McCool: in theory could pack them on the CPU
Quantization and dequantization (QDQ)
RESOLUTION: Add QDQ operators for int8 and int4 and consolidate #93 #128 #623.
<gb> Issue 623 WebNN should support NPU and QDQ operations (by wchao1115) [v2] [opset] [feature request] [device selection]
<gb> Issue 93 Add QuantizeLinear and DequantizeLinear for mixed precision (by kpu) [opset] [feature request]
<gb> Issue 128 WebNN should support int8 quantized models (by wchao1115) [v2] [opset] [feature request]
anssik: based on our F2F discussion, the challenge is how to represent int4 on the web and across backends as discussed in context of Wave 3
anssik: it seems int4 and uint4 support for quantizeLinear and dequantizeLinear landed very recently in Chromium, any learnings to share?
https://
Ningxin: we learned, when we test some models, we need to support int32 for dequantized operator
… we exercise some models with bias in int32
Dwayne: the other aspect is the block size consideration, when dequant, have to expand the tensor to full size before passing to WebNN if the higher level has the concept of block size, uses a lot GPU memory
<ningxin> One example model using int32 dequantize operator: https://
anssik: roll these changes into the Wave 3 PR?
Wayne: yes, fold into Wave 3 PR
Austin: CoreML backend has some limitations, int4 need to be emulated
… three variant of quantization available
Dwayne: the spec as defined, may need some updates after TF and CoreML backend implementation experience
Platform capability detection
anssik: issue #463 and PR #755
<gb> MERGED Pull Request 755 Define opSupportLimits() (by philloooo)
<gb> CLOSED Issue 463 Allow checking whether operators/types are supported for a backend before creating a graph (by huningxin) [feature request]
anssik: Proposal: Collect real-world feedback from opSupportLimits() usage to inform baseline and max limits.
Dwayne: data type support is already in, max limits in remaining work
McCool: max dimension limits for tensors?
Dwayne: some hardware do have such limits
McCool: scatter and gather, how big index is warranted?
<asully> https://
Austin: the spec defines a valid dimension
… CoreML and TF indices are int32 primarily
… trying a model with int64 indices would not work on TF or CoreML
… thus int64 indices should not be necessary
… having a common set that works everywhere, when indices would have to be int32
Dwayne: max dimensions, should they be exposed via opSupportLimits? Or via data types?
Austin: max rank should be exposed via opSupportLimits()
anssik: does ORT exercise opSupportLimits() API?
<dwayner> "Use opSupportLimits to dynamically check data type support" microsoft/
Ningxin: we have WebNN EP implemented for data types, some ONNX models require int64, then WebNN EP will fallback argMax/Min to CPU
… re maximum rank, would like to prioritize that, because SegmentAnything model has high-rank tensor (>5)
… this model will fail without high-rank tensor
<dwayner> Yes, it's a 6D transpose (pattern found in a few models).
Austin: we discussed a minimal set that should be supported everywhere and use opSupportLimits() as an optional feature on top
Device selection abstractions
anssik: Proposal: Draft a spec PR to remove MLDeviceType, keep MLPowerPreference. Gauge prototyping interest and impact to frameworks?
… there was support for removing explicit device type from Reilly and Mike
… Rafael notes the Windows ecosystem is more heterogeneous, not having a device selection mechanism harder
ningxin: my open regarding removal is, can we still have an opportunity for the developer to select a particular device type, e.g. CPU, if we leave in the power preference
zkis: the question is, do we have other use cases where the application wants to control that certain models want to run on low power?
… we need more use cases where application control is relevant
Dwayne: Apple's concern was implementability as a hard requirement
… anybody want to remove this, would like to consider scenarios like desktops with two GPUs
… or NPUs being faster than GPU, does power preference express that?
… or scenario where GPU is already busy, thus NPU usage preferred for better user experience
Austin: I think from my perspective, I'd like to see this be a requirement, but would like to go to OT in Chromium in the next couple of milestones and I don't see us changing the semantics of this option before that OT
… from Chrome team's perspective, eager to get this in hands of real users to understand what they feel about this option
Google Chrome feedback revisited
RESOLUTION: Close Chrome feedback #453 as completed.
<gb> Issue 453 Google Chrome Feedback on WebNN: aiming for broad device coverage and maintainability (by vsekhar) [process] [opset] [use case]
Austin: I can close soon.
Interop issues across different backends
anssik: Proposal: Revisit interop issues e.g. remove pooling's rounding direction from MLPool2dOptions to close #324, decide between clamp or "ignore the index" approach to close #486.
<gb> Issue 324 Simplify the operand layout support of conv2d and pooling 2d operations (by huningxin) [feature request] [operator specific] [interop]
<gb> Issue 486 Add "implementation consideration" about how out-of-bound indices of Gather/Scatter should be handled (by huningxin) [operator specific] [interop]
anssik: there seem to be those few issues where we have an agreement on the solution
… Ningxin, any other interop issues where the group feedback is required to make progress?
Ningxin: interop issues in Wave 3, Jiewei reported gather elements on TFLite, no direct mapping, not an interop issue per se, but needs an emulation path for those backends
… I'll check with Jiewei, currently documented in the Transformers issue #375
<gb> Issue 375 Support for transformers (by dontcallmedom) [v2] [opset]
Austin: prefer a separate issue
Core operator set
anssik: Proposal: Document requirements for adding "new core op" and "non-core op", (consider e.g. TOSA, MLIR linalg), categorize ops in spec.
anssik: issue #573
<gb> Issue 573 Core operator set (by philloooo) [question] [opset]
anssik: per F2F discussion, it looks like the group had consensus to document what it takes to add an operator to the spec
… and see if the guidelines for adding "high-level" (decomposable) vs "low-level" ops should be different
Dwayne: we wanted to define an op low-level if it cannot be decomposed further
ningxin: I agree, as discussed in Wave 3 we probably want to compate TOSA, linalg to find gaps and useful primitives WebNN spec is missing and also address high-level ops, see if some can be removed when expressable with primitives
… optimized implementation via fusion
… custom ops using primitives
anssik: are the existing guidelines still valid?
Dwayne: seem useful to me
Austin: hoping one day we can remove the guidelines because we have all we need :-)
MLTensor
anssik: Proposal: Merge explainer PR with an agreement MLDeviceType changes to impact buffer allocation.
… it's great to see Corentin from WebGPU group continue review the MLTensor explainer and provide insights
… Austin, what are the remaining open questions from Corentin?
Austin: remaining thing to address, from WebNN perspective allocating a tensor is opaque, easy guarantee to give given reading and writing, if you hand out the buffer to WebGPU cannot give the same guarantee
… agreed with Corentin we need to expose the layout of the buffer to the developer, will work on incorporating that feedback into the PR
Dwayne: strides between dimensions can have gaps?
Austin: potentially, I think naively, what does exposing a layout mean, one block, then expose just as a big array
Dwayne: subtiling or blocks
Austin: something WebGPU would need to know
… IIUC on Windows side it's always a big block
Dwayne: it's always linear on Windows
Austin: I'm hoping the same assumption holds on CoreML on Mac, if not, we need to do some further design work
… the challenge is what is the buffer WebGPU is given and how to read and write to it
anssik: there's a path forward, that's great
Austin: hope is we can merge the PR soon
Wide review: TAG
RESOLUTION: Add resource contention considerations to the spec to address TAG review feedback.
anssik: I pushed a PR #765 to add resource contention considerations to the spec
<gb> Pull Request 765 Add resource contention considerations (by anssiko)
anssik: thanks Reilly for your review
Tensor primitives
anssik: Proposal: Continue explore authoring high-level ops with tensor primitives.
… the goal to demonstrate we can composite custom ops using unified tensor-level primitives
… Ningxin, would you like to get group's input on the proof-of-concept direction?
ningxin: this topic is synergistic with the core op set discussion
zkis: the questions about composability, how can we optimize memory structure
<ningxin> will do
Translation and Prompt APIs
anssik: Proposal: Solicit input on whether to adopt these APIs as new deliverables in the WG's Charter 2025->
… would like to hear early signals for adopting the Translation and Prompt APIs into the WG
… official W3C-wide review would happen in Q1'25 when we recharter
<zkis> in the WG or the CG? what is the preference?
anssik: editor preference was the WG
<etienne> I'll be leading this area at Google so would love for this area to be part of this group.
Etienne: initially proposing Translation and Prompt APIs for the WG adoption
Natasha: want to understand customer requirements, in terms of standardization this forum in important for getting feedback from other vendors
… another feature would be Storage, how origins could share or not share access to these large models
https://
anssik: hearing initial support from Google and Microsoft for adopting Translation and Prompt APIs in the WebML WG, we will solicit more input from other WG participants before we initiate official rechartering and W3C-wide review in Q1'25