WebML WG Teleconference – 4 April 2024

Meeting minutes

Repository: webmachinelearning/webnn

anssik: please join me in welcoming our new WG participants:
… - Vasil Dedejski from Netcetera
… - Juan Deng from Alibaba Group
… - also Michael McCool from Intel joins in an official WG participant capacity

NPU support discussion

anssik: issue #623

<gb> Issue 623 WebNN should support NPU and QDQ operations (by wchao1115) [v2] [opset] [feature request]

anssik: The goal of this discussion is to revisit the problem space for NPU support for WebNN, review the high-level proposal for the key elements of the design, gather feedback and signals from the group regarding interest, scope, priority.
… NPU device type and support for quantized models have been explored in the group prior and have been awaiting implementation experience.
… earlier discussion in issues #128 and #302

<gb> Issue 128 WebNN should support int8 quantized models (by wchao1115) [v2] [opset] [feature request]

<gb> Issue 302 API simplification: context types, context options, createContext() (by zolkis) [v2]

anssik: thanks to Chai for the proposal and Yajing for comments

Dwayne: NPUs are probably familiar to most, not necessarily faster than GPUs but more efficient
… in the spec we have GPU and CPU and power preferences, data types are also in the spec
… we're missing NPU device type and bare minimum operations, quantization and dequantization

<McCool> (when Dwayne is done)

Dwayne: linear 8-bit is typical, also more aggressive ones even 1-bit exists, we think should start with the most common ones
… device type and quantization ops are the key elements, moving forward need to enumerate the options what the options would look at
… bikeshedding areas how to express this in the context, CPU, GPU, NPU, how to express that? a primary device type and a fallback device type? a priority order?

Michael: I wanted to point our quantization is also important for CPU and GPU, due to memory bandwidth issues
… quantization representation should be perhaps separated from the target device
… we're looking at impacts on caching, if a model is quantized or not

Phillis: I provided feedback in the comments, the current proposal is to have add NPU and fallbacks, it assumes NPU has a smaller set of ops, but CPU and GPU have full coverage
… based on implementation experience from TFLite and CoreML backend, op set coverage is smaller compared to CPU
… so if we want to signal fallback, need to signal it for other device types too
… CoreML is opaque in terms on in which compute unit the workload executes on
… sometimes the workload is executed on CPU if the tensor is small enough, a blackbox style design

Dwayne: I know Mingming submitted a patch to Chromium for NPU device type to experiment that
… we don't have a concept of a fallback yet

Zoltan: Phillis, can we say CPU is always the fallback device? Should we separate fallback from the hints?

Phillis: I think hints are more accurate, match the underlying behavior better, a soft signal to the underlying backend it prefers NPU

Zoltan: CPU as a fallback device? can we spec it as such it always works

Dwayne: we could spec this as a preference, and get a signal back to the developer if it doesn't work

Dwayne: CPU as a fallback has a challenge needs to do graph partitioning

Phillis: CoreML itself does graph partitioning

Dwayne: similarly for DML too in the future, happens transparently

Phillis: no need to worry about subgraphs partitioning then

Ningxin_Hu: I can speak for Mingming and the implementation

Chromium implementation

Ningxin_Hu: we've added NPU device type for DML path to testing
… these is no fallback in the current implementation, we are collaborating with the DML team and Intel driver team for fallback experiment
… to introduce the NPU device type, we can run a small set of models, using that as a starting point
… later with fallback to CPU and GPU

Hybrid AI exploration update

anssik: I've asked the project team to give a brief update on the proof-of-concept informed by the group's feedback. Thanks everyone for your insightful feedback shared to date!

webmachinelearning/proposals#5

<gb> Issue 5 Hybrid AI Exploration (by grgustaf)

Michael: presenting a slide with a summary of feedback received from the group
… Models/Use Cases:
… 1) MMS – ASR
… 2) SeamlessM4T - general speech tasks
… 3) Others? LLMs? e.g. mistral-7b
… Comments
… Two forms of hybrid AI
… - Server or Client
… - Split models
… Meaning of "hybrid"? Can also mean "heterogenous hardware"
… Pain points, priority
… Generally saving space/download latency
… 1. Sharing/reusing large models across sites
… 2. Same resources at different URLs
… 3. Same resources in different formats
… 4. Want to expose "built-in" models
… 5. Generalizing models
… 6. Need solution that can handle adapters

jsbell: wanted to ask if your'e engaged with privacy groups?

Michael: have thought about shared caches and privacy considerations, ways to mitigate that

jsbell: wanted to make sure the privacy in implementations are considered, if a small number of sites are using a specific model then that 1-bit of information is significant

anssik: privacy considerations are important, proposing those are documented along with the proposal

Open issues and PRs

Debrief on PRs merged recently

anssik: This topic is for the editors and PR authors to debrief the group on substantive PRs that got merged in the last few weeks, answer question from the group.

Recently merged PRs
… PRs merged by issue type since last meeting:
… conventions: #618 #621 #627
… bug fix: #616 #619 #620
… question: #617 #632

<gb> MERGED Pull Request 618 Conventions: Add and apply a few more spec coding conventions (by inexorabletash) [conventions]

<gb> MERGED Pull Request 627 Conventions: Use "rank" for variable names, when appropriate (by inexorabletash) [conventions]

<gb> MERGED Pull Request 621 Conventions: Ensure all dict members have definitions (by inexorabletash)

<gb> MERGED Pull Request 619 Bugfix: Drop "re-throw" in MLActivation creation steps (by inexorabletash)

<gb> MERGED Pull Request 616 Bugfix: Unbalanced autolink brackets (by inexorabletash)

<gb> MERGED Pull Request 620 Bug fix: Drop "... have been checked..." notes (by inexorabletash)

<gb> MERGED Pull Request 632 add a note for empty input (by philloooo)

<gb> MERGED Pull Request 617 Update NSNet2 reference (by anssiko)

anssik: anything the editors or PR authors feel important to highlight for the broader group who may not follow day-to-day GH activity?

jsbell: not specific to any of these PRs, FYI, I'm working on a local tool written in Node.js to enforce conventions, will share it with the group when better baked in, qSA against the DOM and grepping against the source

Dwayne: in CI or local?

jsbell: to be determined

anssik: much thanks Josh for your work on this tool

[bug] Need clarify scale factor for resample2d

anssik: issue #610 is about overflow handling convention preferences

<gb> Issue 610 Need clarify scale factor for resample2d (by BruceDai) [bug]

anssik: Josh reports part 1 of this issue was addressed by https://github.com/webmachinelearning/webnn/commit/24edf7f775ccb5105b3503449c714e2b7af563e6
… part 2 is not yet addressed in the spec IIUC:
… "clarify the limitations for the product of scale factor and spatial dimension's size"
… it looks like the validation steps of Resample2d were fixed in the implementation
… are the remaining spec changes clear? anything to discuss?

jsbell: part 2, is around validation of clamping that may be required

Dwayne: would be a lot of noise if we'd add that to all ops, can we centrally spec that?

<ningxin> +1 to handle overflow centrally

[bug] Synchronously validate input operands/activations

anssik: issue #572

<gb> Issue 572 Synchronously validate input operands/activations (by inexorabletash) [bug] [question]

anssik: merged PRs #591 and #605 addressed parts of this issue

<gb> MERGED Pull Request 591 Content: Define operand concept, simplify graph connection steps (by inexorabletash)

<gb> MERGED Pull Request 605 Synchronously validate input operands/validations (by inexorabletash)

anssik: Josh identified the following as remaining work:

- Standard phrasing for "be an operator for ..."

- Introducing "Validate arguments" section for each method

- Introducing "Calculate output shape" section for each method (maybe "output descriptor" is better?)

jsbell: just any feedback on those is welcome
… first is asking how we should phrase that, easy PR for someone who want to say "let's do it this way!"
… others have come up in this issue, feedback wanted on those too, don't want to add too many PRs that just add text

[question] Allow no-op graphs?

anssik: issue #614

<gb> Issue 614 Allow no-op graphs? (by inexorabletash) [question]

anssik: Josh reports in PR #603 a step was added to build() to match the Chromium implementation, which errors out if an input operand or a constant operands is specified as an output.

<gb> MERGED Pull Request 603 Content: Define build() steps more rigorously (by inexorabletash)

anssik: Dwayne notes this would nullify the possibility of a nop graph but also noted that a caller can always insert a dummy identity node to satisfy this constraint
… Ningxin notes constant-only graph seems useful, especially for GPU or NPU device
… also good discussion for "constant MLBuffer"
… proposed by Bryan as "after builder.constant(mlBuffer), it becomes read-only (ex. no writeBuffer() or dispatch() allowed)."

Dwayne: I don't think this is a big deal

ningxin: I'd like to make it clear that my comment is about constant-only graph as the use case, a model with two decoder graphs
… later in discussion with Austin and Bryan my use case is satisfied by "constant MLBuffer", so wanted to clarify this
… another comment, I want to understand if there's a use case for a no-op graph?

Dwayne: chaining models together and get flexibility along that chain, if there's a concept that just takes input and passes that along, but dummy identity would still satisfy that

ningxin: WebNN is supposed to be a backend API, so not sure if that in scope

Dwayne: no reservations to resolve this with an empty graph

ningxin: because Bryan mentioned this was added to TODO we can track this as part of MLBuffer proposal

[question] Graph with no input

anssik: issue #615 and PR #632

<gb> MERGED Pull Request 632 add a note for empty input (by philloooo)

<gb> CLOSED Issue 615 Graph with no input (by philloooo) [question]

anssik: this one was fixed, thanks!

Phillis: I added a note that says WebNN does support that, if the backend does not support this, implementations can work around this.

[question] Can an MLGraphBuilder be reused?

anssik: issue #567

<gb> Issue 567 Can an MLGraphBuilder be reused? (by reillyeon) [question]

anssik: Reilly asks "Are there any known use cases for MLGraphBuilder reuse?
… Ningxin mentions one use case:
… "Whisper, because WebNN doesn't support If operator, there will be two WebNN sub-graphs being built. One with past Key Value (KV) cache ("with_past") and another one without past KV cache ("no_past"). The inference code will run the "no_past" sub-graph for the first iteration and run the "with_past" sub-graph for the following iterations. The two sub-graphs actually share some common weights. It would be useful if the same

weights being built by builder.constant() can be taken by the operators of two sub-graphs."

ningxin: this is related to the previous topic, a merged model in ONNX terms, has two subgraphs with If operator, because WebNN does not support If op, in the backend the if op will fallback to Wasm, one with passed KV cache, and another w/o passed KV cache
… after the first iteration a value can be reused in KV cache, later on cache is reused until the sequence length is reached
… used in transformer models, including Whisper
… for ONNX RT Web, for subgraphs, weights are shared, we create a constant for weights in the same GraphBuilder and reuse them, in this use case we can reused MLGraphBuilder
… later if we talk about MLBuffer, if we can make MLBuffer hold the weights, persist in device memory, that'd be even better, two graphs can use the constants with MLBuffer w/o any duplication of data uploading or memory copy
… also avoid duplication of copies of device memory
… I think the two things for constant and MLGraphBuilder for reuse are related

<jsbell> From Reilly Grant: I'm satisfied that there are good use cases for constructing multiple MLGraphs from an MLGraphBuilder but we need example code and implementation experience before we can decide specifically how it will work. In particular I'm still concerned by the constraints it puts on implementations if build() can be called an arbitrary

<jsbell> number of times and so I'd like the group to consider a multibuild() variant which allows compiling multiple graphs simultaneously which is likely to be more efficient for implementations while hopefully giving the same power to developers.

<ningxin> +1 to explore more

[question] Consider adopting new broadcasting rules

anssik: issue #590

<gb> Issue 590 Consider adopting new broadcasting rules (by a-sully) [question]

anssik: discussed on our 21 March telcon
… to recap, Austin sees three options:
… Option 2: Adopt XLA's broadcasting rules
… Option 1: Adopt NumPy's broadcasting rules
… Option 3: Keep the status quo
… Dwayne offered one more option:
… Option 4: Dwayne's proposal
… - keep unidirectional for rare cases (expand and GEMM)
… - research more backends to potentially add restrictions that inputs must have the same rank.

anssik: I'm hearing we let this issue sit

[question] Is "validate graph resources" backwards?

anssik: issue #602 and PR #622 (thanks Josh!)

<gb> Pull Request 622 Bug fix: Make "validate graph resources" test reflexive (by inexorabletash)

<gb> Issue 602 Is "validate graph resources" backwards? (by inexorabletash) [question]

jsbell: put up the PR for the reflexive test
… came up with a new behaviour after the telcon, reflected in the PR
… validation of input and output asymmetric, for better developer ergonomics

[question] Need clarify the usage of axes=[0,1] for resample2d

anssik: issue #624

<gb> Issue 624 Need clarify the usage of axes=[0,1] for resample2d (by BruceDai) [question] [operator specific]

anssik: this implementation experience informed question has details in the issue

Dwayne: came in code review, resample supports a number of different axes
… anyone knows the history of this design?

[feature-request] Gaussian error linear unit (GELU) activation

anssik: issue #626

<gb> Issue 626 Need Gelu operation (by mingmingtasd) [feature request] [operator specific]

anssik: a proposed new op passes our initial tests:
… sample models: Whisper base, Stable Diffusion U-Net, Segment Everything decoder
… cross-framework support: ONNX, TF, PyTorch
… cross-platform implementability: CoreML, DirectML, OpenVINO
… this seems like a reasonable addition, any comments?

<jsbell> Seems reasonable

ningxin: Mingming mentioned this gives perf gain if supported
… in the experimental implementation on NPU we observed perf benefit if we have Gelu over emulation path
… with this activation we can fuse with matmul for even better performance
… PR #628 is out already

<gb> Pull Request 628 Define Gelu operation (by mingmingtasd)

– DRAFT –
WebML WG Teleconference – 4 April 2024

04 April 2024

Attendees