WebML WG Teleconference – 17 October 2024

Meeting minutes

Repository: webmachinelearning/webnn

anssik: I hope you had a great time at TPAC!
… today, we'll resume our bi-weeklies with a refresher on selected resolutions and proposals from F2F, and discuss our next step
… but first, I'd like to welcome our most recent new participants:
… Talha Gorsi and Robert Simpson from Qualcomm
… Sohum Chatterjee from Microsoft
… welcome to the WebML WG!

WebNN Operator Update Wave 3

Slides

F2F minutes

RESOLUTION: Update spec with Wave 3 operators, initiate int4/uint4 wide review.

anssik: Dwayne's Wave 3 plan as documented in his slides received group's support
… the editors can now start formulate the Wave 3 changes as a spec PR
… all-in-one PR or multiple smaller PRs as long as self-contained

Dwayne: prefer one PR initially

<ningxin> sgtm

Dwayne: everyone on the call has seen the slides, one interesting update, all Wave 3 ops are in Chromium now
… WebNN EP to validate this Chromium work
… 8 remaining ops out of 12 there
… no new ops in mind beyond Wave 3 for now

anssik: I think the group should check in with the TAG for the int4/uint4 type
… the plan is to expose these 4-bit types through Int8Array using a packing approach since there's no Int4Array in JS

https://tc39.es/ecma262/multipage/indexed-collections.html#table-49

anssik: we could land our int4/uint4 proposal into the spec and ask TAG to review it, no need to block on the TAG review to land the spec PR given we're doing a CR Draft here

McCool: bitwise ops?

Dwayne: need to consider CoreML compat
… MikeW shared that it might be something they could add in the future but could take a while, meanwhile need emulation path
… via BNNS or MPS

McCool: in theory could pack them on the CPU

Quantization and dequantization (QDQ)

F2F minutes

RESOLUTION: Add QDQ operators for int8 and int4 and consolidate #93 #128 #623.

<gb> Issue 623 WebNN should support NPU and QDQ operations (by wchao1115) [v2] [opset] [feature request] [device selection]

<gb> Issue 93 Add QuantizeLinear and DequantizeLinear for mixed precision (by kpu) [opset] [feature request]

<gb> Issue 128 WebNN should support int8 quantized models (by wchao1115) [v2] [opset] [feature request]

anssik: based on our F2F discussion, the challenge is how to represent int4 on the web and across backends as discussed in context of Wave 3

anssik: it seems int4 and uint4 support for quantizeLinear and dequantizeLinear landed very recently in Chromium, any learnings to share?

https://chromium-review.googlesource.com/c/chromium/src/+/5922495

Ningxin: we learned, when we test some models, we need to support int32 for dequantized operator
… we exercise some models with bias in int32

Dwayne: the other aspect is the block size consideration, when dequant, have to expand the tensor to full size before passing to WebNN if the higher level has the concept of block size, uses a lot GPU memory

<ningxin> One example model using int32 dequantize operator: https://huggingface.co/webml/models/resolve/main/int8/resnet50-v1-12-qdq.onnx

anssik: roll these changes into the Wave 3 PR?

Wayne: yes, fold into Wave 3 PR

Austin: CoreML backend has some limitations, int4 need to be emulated
… three variant of quantization available

Dwayne: the spec as defined, may need some updates after TF and CoreML backend implementation experience

Platform capability detection

F2F minutes

anssik: issue #463 and PR #755

<gb> MERGED Pull Request 755 Define opSupportLimits() (by philloooo)

<gb> CLOSED Issue 463 Allow checking whether operators/types are supported for a backend before creating a graph (by huningxin) [feature request]

anssik: Proposal: Collect real-world feedback from opSupportLimits() usage to inform baseline and max limits.

Dwayne: data type support is already in, max limits in remaining work

McCool: max dimension limits for tensors?

Dwayne: some hardware do have such limits

McCool: scatter and gather, how big index is warranted?

<asully> https://www.w3.org/TR/webnn/#valid-dimension

Austin: the spec defines a valid dimension
… CoreML and TF indices are int32 primarily
… trying a model with int64 indices would not work on TF or CoreML
… thus int64 indices should not be necessary
… having a common set that works everywhere, when indices would have to be int32

Dwayne: max dimensions, should they be exposed via opSupportLimits? Or via data types?

Austin: max rank should be exposed via opSupportLimits()

anssik: does ORT exercise opSupportLimits() API?

<dwayner> "Use opSupportLimits to dynamically check data type support" microsoft/onnxruntime#22025

<gb> MERGED Pull Request 22025 [WebNN EP] Use opSupportLimits to dynamically check data type support (by Honry) [ep:WebNN]

Ningxin: we have WebNN EP implemented for data types, some ONNX models require int64, then WebNN EP will fallback argMax/Min to CPU
… re maximum rank, would like to prioritize that, because SegmentAnything model has high-rank tensor (>5)
… this model will fail without high-rank tensor

<dwayner> Yes, it's a 6D transpose (pattern found in a few models).

Austin: we discussed a minimal set that should be supported everywhere and use opSupportLimits() as an optional feature on top

Device selection abstractions

Slides

F2F minutes

anssik: Proposal: Draft a spec PR to remove MLDeviceType, keep MLPowerPreference. Gauge prototyping interest and impact to frameworks?
… there was support for removing explicit device type from Reilly and Mike
… Rafael notes the Windows ecosystem is more heterogeneous, not having a device selection mechanism harder

ningxin: my open regarding removal is, can we still have an opportunity for the developer to select a particular device type, e.g. CPU, if we leave in the power preference

zkis: the question is, do we have other use cases where the application wants to control that certain models want to run on low power?
… we need more use cases where application control is relevant

Dwayne: Apple's concern was implementability as a hard requirement
… anybody want to remove this, would like to consider scenarios like desktops with two GPUs
… or NPUs being faster than GPU, does power preference express that?
… or scenario where GPU is already busy, thus NPU usage preferred for better user experience

Austin: I think from my perspective, I'd like to see this be a requirement, but would like to go to OT in Chromium in the next couple of milestones and I don't see us changing the semantics of this option before that OT
… from Chrome team's perspective, eager to get this in hands of real users to understand what they feel about this option

Google Chrome feedback revisited

F2F minutes

RESOLUTION: Close Chrome feedback #453 as completed.

<gb> Issue 453 Google Chrome Feedback on WebNN: aiming for broad device coverage and maintainability (by vsekhar) [process] [opset] [use case]

Austin: I can close soon.

Interop issues across different backends

F2F minutes

"interop" issues

anssik: Proposal: Revisit interop issues e.g. remove pooling's rounding direction from MLPool2dOptions to close #324, decide between clamp or "ignore the index" approach to close #486.

<gb> Issue 324 Simplify the operand layout support of conv2d and pooling 2d operations (by huningxin) [feature request] [operator specific] [interop]

<gb> Issue 486 Add "implementation consideration" about how out-of-bound indices of Gather/Scatter should be handled (by huningxin) [operator specific] [interop]

anssik: there seem to be those few issues where we have an agreement on the solution
… Ningxin, any other interop issues where the group feedback is required to make progress?

Ningxin: interop issues in Wave 3, Jiewei reported gather elements on TFLite, no direct mapping, not an interop issue per se, but needs an emulation path for those backends
… I'll check with Jiewei, currently documented in the Transformers issue #375

<gb> Issue 375 Support for transformers (by dontcallmedom) [v2] [opset]

Austin: prefer a separate issue

Core operator set

F2F minutes

anssik: Proposal: Document requirements for adding "new core op" and "non-core op", (consider e.g. TOSA, MLIR linalg), categorize ops in spec.

anssik: issue #573

<gb> Issue 573 Core operator set (by philloooo) [question] [opset]

anssik: per F2F discussion, it looks like the group had consensus to document what it takes to add an operator to the spec
… and see if the guidelines for adding "high-level" (decomposable) vs "low-level" ops should be different

https://github.com/webmachinelearning/webnn/blob/main/CONTRIBUTING.md#proposing-and-adding-a-new-operation

Dwayne: we wanted to define an op low-level if it cannot be decomposed further

ningxin: I agree, as discussed in Wave 3 we probably want to compate TOSA, linalg to find gaps and useful primitives WebNN spec is missing and also address high-level ops, see if some can be removed when expressable with primitives
… optimized implementation via fusion
… custom ops using primitives

anssik: are the existing guidelines still valid?

Dwayne: seem useful to me

Austin: hoping one day we can remove the guidelines because we have all we need :-)

MLTensor

F2F minutes

anssik: Proposal: Merge explainer PR with an agreement MLDeviceType changes to impact buffer allocation.
… it's great to see Corentin from WebGPU group continue review the MLTensor explainer and provide insights
… Austin, what are the remaining open questions from Corentin?

Austin: remaining thing to address, from WebNN perspective allocating a tensor is opaque, easy guarantee to give given reading and writing, if you hand out the buffer to WebGPU cannot give the same guarantee
… agreed with Corentin we need to expose the layout of the buffer to the developer, will work on incorporating that feedback into the PR

Dwayne: strides between dimensions can have gaps?

Austin: potentially, I think naively, what does exposing a layout mean, one block, then expose just as a big array

Dwayne: subtiling or blocks

Austin: something WebGPU would need to know
… IIUC on Windows side it's always a big block

Dwayne: it's always linear on Windows

Austin: I'm hoping the same assumption holds on CoreML on Mac, if not, we need to do some further design work
… the challenge is what is the buffer WebGPU is given and how to read and write to it

anssik: there's a path forward, that's great

Austin: hope is we can merge the PR soon

Wide review: TAG

F2F minutes

RESOLUTION: Add resource contention considerations to the spec to address TAG review feedback.

anssik: I pushed a PR #765 to add resource contention considerations to the spec

<gb> Pull Request 765 Add resource contention considerations (by anssiko)

anssik: thanks Reilly for your review

Tensor primitives

Slides

F2F minutes

anssik: Proposal: Continue explore authoring high-level ops with tensor primitives.
… the goal to demonstrate we can composite custom ops using unified tensor-level primitives
… Ningxin, would you like to get group's input on the proof-of-concept direction?

ningxin: this topic is synergistic with the core op set discussion

zkis: the questions about composability, how can we optimize memory structure

<ningxin> will do

Translation and Prompt APIs

Slides

F2F minutes

anssik: Proposal: Solicit input on whether to adopt these APIs as new deliverables in the WG's Charter 2025->
… would like to hear early signals for adopting the Translation and Prompt APIs into the WG
… official W3C-wide review would happen in Q1'25 when we recharter

<zkis> in the WG or the CG? what is the preference?

anssik: editor preference was the WG

<etienne> I'll be leading this area at Google so would love for this area to be part of this group.

Etienne: initially proposing Translation and Prompt APIs for the WG adoption

Natasha: want to understand customer requirements, in terms of standardization this forum in important for getting feedback from other vendors
… another feature would be Storage, how origins could share or not share access to these large models

https://developers.google.com/privacy-sandbox/cookies/related-website-sets

anssik: hearing initial support from Google and Microsoft for adopting Translation and Prompt APIs in the WebML WG, we will solicit more input from other WG participants before we initiate official rechartering and W3C-wide review in Q1'25

– DRAFT –
WebML WG Teleconference – 17 October 2024

17 October 2024

Attendees

Meeting minutes

WebNN Operator Update Wave 3

Quantization and dequantization (QDQ)

Platform capability detection

Device selection abstractions

Google Chrome feedback revisited

Interop issues across different backends

Core operator set

MLTensor

Wide review: TAG

Tensor primitives

Translation and Prompt APIs

Summary of resolutions

Diagnostics