WebML WG Teleconference – 14 November 2024

Meeting minutes

Welcome to our new participant Kaushik Satpathy from Yahoo!

Announcements

anssik: implementation status of WebNN has been updated, thank you all for the great progress!

Implementation Status of WebNN

webmachinelearning/webmachinelearning.github.io#84

<gb> MERGED Pull Request 84 November update for Impl Status (by ibelem)

anssik: also Awesome WebNN, a curated list of awesome things related to the WebNN API, has received updates

Awesome WebNN

webmachinelearning/awesome-webnn#11

<gb> MERGED Pull Request 11 November 2024 Update (by ibelem)

anssik: please share this reference to people interested in this topic for the latest articles, demos, presentations, samples, tutorials, videos and more about WebNN and the ecosystem around it

Call for review: WebML Community Group Charter update

Repository: webmachinelearning/charter

anssik: on 2024-11-01 we initiated a call for review of the Web Machine Learning Community Group Charter update, open until 2024-12-02
… folks who are also WebML Community Group participants are encouraged to review the Charter proposal
… to be eligible to vote, you must be a CG participant

WebML CG Charter update, vote by 2024-12-02

How to join the CG:
… Summary of changes: refresh Goals, add Task-specific APIs and Prompt API to Deliverables, note WebNN has graduated to the WG
… for more information about the proposed task-specific APIs, please refer to the TPAC 2024 presentation:

TPAC 2024 slides for task-specific APIs

anssik: and the GH repos for the proposals:

WICG/translation-api

WICG/writing-assistance-apis

explainers-by-googlers/prompt-api

anssik: any questions?

Device selection abstractions

Repository: webmachinelearning/webnn

anssik: Zoltan synthesized the group's current thinking and discussions into a device selection explainer, thanks!
… issues #749 and PR #784

<gb> Issue 749 MLContextOptions.deviceType seems unnecessary outside of conformance testing (by mwyrzykowski) [device selection]

<gb> Pull Request 784 Add device selection explainer (WiP) (by zolkis)

anssik: it discusses intro, history, key use cases and requirements, considered alternatives, examples and design
… and open questions
… the doc is written so that we can hand this to folks outside this group for review, e.g. privacy, TAG etc.

zolkis: this is a collection of thoughts from GH issues, it is WIP still
… Ningxin and Chai contributed in the early phase, Rafael, Joshua, MikeW

Explainer

zolkis: MikeW's proposal is in considered alternatives, fingerprinting story the remaining concern
… we can possibly go with Mike's proposal if we give more examples

anssiko: considered alternatives:
… 1. Keep the current MLDeviceType as a context option, but improve the device type names
… 2. Follow this proposal MLOpSupportLimits should be opt-in per #759

<gb> Issue 759 MLOpSupportLimits should be opt-in with base functionality (by mwyrzykowski) [device selection]

MikeW: great document, thanks for writing this! Option 1 is simpler, could flesh that out right now

zolkis: listing of opLimits is inside the context, if we move it out then we know what the underlying platform is capable of, can match with the model to run
… we need more concrete examples for Option 2, whether go with full WebGPU adapter approach
… Option 2 could come after Option 1

Dwayne: would Option 1 be relaxing the device type to be a hint?

zolkis: PowerPerformance would provide the hint that the implementation could use to map to underlying processing unit(s), or could rename the MLDeviceType

zolkis: let's solicit more use cases to have a complete view

anssik: for feedback, please use the PR #784

<gb> Pull Request 784 Add device selection explainer (WiP) (by zolkis)

MLTensor

anssik: The group is gathering implementation experience on the MLTensor design to inform an upcoming specification update.
… The explainer is considered the source of truth in this prototyping phase.
… I'd like to discuss the open questions, currently 5

https://github.com/webmachinelearning/webnn/blob/main/mltensor-explainer.md#open-questions

Austin: not blocking forward progress, the first bullet has come up in the Chromium implementation, we deprecated compute() for dispatch(), and if there's an error you don't find about it

How will errors be surfaced?

anssik: issue #477

<gb> CLOSED Issue 477 API lacks handling for async ML device errors on the context (by bbernhar) [question]

Bryan: want to understand how backends can surface errors midway?

Austin: for many backends peak memory usage can exceed what's available, you could OOM while inferencing
… seeing failures on some backends because we have implementation gaps, a few classes of errors
… e.g. trying to allocate too much memory, model file is deleted from disk, compile the model and think everything is good but things change undernearth
… or if you do scatter or gather with OOB indices

<jsbell> webmachinelearning/webnn#778 has Austin's observations about types of errors

<gb> Issue 778 Proposal: Report non-fatal errors from the WebNN timeline (by a-sully) [feature request]

Austin: CoreML backend may have higher peak memory usage during inferencing than after compile

Bryan: writeTensor() has the same issues as dispatch()

(this is expanded in issue #778)

<gb> Issue 778 Proposal: Report non-fatal errors from the WebNN timeline (by a-sully) [feature request]

Core op set & MLIR Linalg mapping

anssik: issue #573

<gb> Issue 573 Core operator set (by philloooo) [question] [opset]

anssik: this topic is to discuss core op set, primitive ops informed by MLIR Linalg, PyTorch Prims IR, TOSA, StableHLO others.
… I propose we look at the MLIR Linalg mapping today
… Dwayne contributed a preliminary analysis of op correspondence (thanks!):

Machine Learning Operator Mapping

anssik: Dwayne notes WebNN demonstrates viability of popular models, but it lacks breadth
… implementing all the 800+ ops is untenable due to interop requirements (multiple browsers, multiple underlying platforms and backends), this is why we are investigating what make for an appropriate set of primitive ops to allow composition

Dwayne: no firm recommendations, but some categories that are absent
… WebNN backend support 1D to 3D
… modular div, rounding, bitwise
… composite ops, lego blocks for decomposition of other ops, e.g. sumPooling
… sheet legend: yellow = absent; red = not interesting, it's a named variant

Dwayne: yellow = worth adding to WebNN

Joshua: have you done analysis what backends in Chromium support these?

Dwayne: it is future work

Joshua: do we look at the current backends, or look at primitive core ops on top of which everything can be constructed on

anssik: how could the group help?

Dwayne: investigation on CoreML backend and TFLite support for these would be welcome

Support reverse operator

anssik: issue #773

<gb> Issue 773 Support `reverse` operator (by huningxin) [feature request] [operator specific]

anssik: Ningixin proposed a reverse op that reverses the order of the input tensor along specified axes
… improves performance of PyTorch models
… framework support PT, TF, ONNX (with reverse slicing with step -1)
… native APIs DML (simiarly to ONNX), CoreML, TFLite
… also in primitive opsets StableHLO, TOSA, PT Prims

Ningxin: we have Chromium CL to prototype

Support strides option for slice operator

anssik: issue #772

<gb> Issue 772 Support strides option for `slice` operator (by huningxin) [feature request] [operator specific]

anssik: Ningxin reports stride option for the slice operator is widely supported, but WebNN's flavour only support stride of 1
… real-world models with stride > 1 cause WebNN to fallback to other EP impacting performance

anssik: can we link to some sample models in this issue?

Ningxin: the model is a transformer-based model with some customization, possibly not shareable yet

Dwayne: it is an audio model, I can share that it is for speech recognition usage, consider it a Whisper variant of sorts

Support block-wise quantization

anssik: issue #779

<gb> Issue 779 Support block-wise quantization (by huningxin) [operator specific]

anssik: request to support block-wise quantization
… allows input tensors be divided into smaller independently quantized blocks, used by SLMs
… benefits include faster optimization and high precision quantization
… DML and CoreML support, it seems no TFLite/LiteRT?
… Dwayne suggests a decomp path is viable?
… no API signature changes, only changes to the algorithm

Ningxin: we have a prototype in Chromium for DML backend and we successfully used that to enable Phi3-mini with this capability
… this prototype is successful

Joshua: would be great if you could share in the issue performance improvements "X times faster"

Ningxin: for TF we need a composition, can file an issue for TFLite
… I can follow up on that
… for CoreML, Austin can comment

Austin: more constraints than DML, block-wise quant is theoretically supportable

Ningxin: Phi3-mini uses block-wise quantization and will hit the CoreML implementation

<dwayner> The memory savings were huge (I forget the exact numbers before/after, but IIRC 20GB's before o_o).

<jsbell> If any TPAC attendees can answer the question in webmachinelearning/webnn#470 (comment) please chime in

<gb> Issue 470 Simplify `matmul` op (by huningxin) [operator specific]

– DRAFT –
WebML WG Teleconference – 14 November 2024

14 November 2024

Attendees