WebML WG Teleconference – 2 November 2023

Meeting minutes

Repository: webmachinelearning/webnn

<Joshua_Lochner> and Joshua Bell :)

WebNN v2: Review transformer ops spec contributions (continued)

anssik: issue #375

<gb> Issue 375 Support for transformers (by dontcallmedom) [v2] [operation set]

anssik: continuing from our previous call, recapping with some resources first

Guidelines for adding new operations

anssik: we have identified use cases and sample models:

Text-to-image: stable-diffusion-v1-5

Image segmentation: segment-anything

Speech-to-text: whisper-tiny

Text-to-text generation (encoder-decoder): t5-small

Text-to-text generation (encoder-decoder): m2m100_418M

Text-generation (decoder-only): Llama-2-7b

anssik: we have done an op decomposition assessment:

Transformer Models Analysis)

anssik: Chai, you indicated you've been working on a big PR for the new op spec definitions
… anything you'd like to bring to the group's attention in this meeting, or questions to ask from the group?
… I'm wondering whether it'd help if the big PR is split into smaller PRs so those pieces could be reviewed in parallel while new ops are still being defined? Or are there dependencies that make this not practical?

Chai: this PR is huge change, but mostly additions, no breaking changes yet, still working on it, past midpoint
… some comments, the proposal makes sense, is a natural extension for what we have, there was a suggestion to combine a few ops that have nothing to do with the data itself, but update its shape
… e.g. squeeze, unsqueeze, flatten2d
… the other part is normalization, a good design example of design principles we have
… normalization ops are similar
… my preference to keep things in one PR, want to keep things consistent

anssik: the PR does not need to be perfect, we'll review it in the group and perfect it together

Ningxin_Hu: update about Transformer Models Analysis
… int8 quantized models for whisper-tiny have been added, encoder and decoder columns for int8
… feedback from the community was quantized model is used in production
… four new ops used in the quantized model, conv integer, matmul integer
… dequantized liner, dynamic quantize linear

Joshua_Lochner: context for 8-bit weights, people wanted to use smaller weight with Transformers.js

<Joshua_Lochner> https://huggingface.co/spaces/Xenova/whisper-web

<Joshua_Lochner> https://i.imgur.com/oPCEtXs.png

Joshua_Lochner: I experimented with this, text-to-speech not good perf, other way around speech-to-text performs well
… as an example, whisper-tiny is 40 MB with 8-bit quantized
… very little quality degradation, able to run on mobile and low resource computers, surprising results

<Joshua_Lochner> huggingface/distil-whisper

Joshua_Lochner: another thing, distilled version of whisper-medium and whisper-large to be released in an hour
… paper is also ready

Chai: noting the transformers ops will not inclulde quantized ops yet in the big PR
… regarding the size of the spec, it is pretty big, thinking factoring out some section to a more suitable medium
… e.g. implementation sections are about parameter validation, some parts covered by WPT tests
… I feel as the spec grows, it will be more important to keep only the relevant part of the ops in the document, and offload some of this validation in WPT in a more complete way

Chai: implementation sections could be offloaded, e.g. param validation

Joshua_Lochner: these new models use the same architecture as the already existing, the distilled versions being released, internal decoder layers are removed, no new ops added
… any op in whisper-tiny encoder or decoder will be the same for 8-bit quantized versions of the models

Enhancements

0D scalars

anssik: #390

<gb> Issue 390 0D scalars (by fdwr)

anssik: zero-dimension scalars, we discussed this earlier this year and I'd like to drive this to resolution
… Dwayne identified a need for zero-dimension scalars while prototyping additional models we're now discussing
… there's data that suggests this enhancement should be baked into the API spec, consider:
… - every ML library represents 0D scalars via the shape e.g. Numpy, TF, PyTorch, ONNX, XNNPACK, SafeTensors file format too
… - the proposed spec enhancement is very small, deletion of one line in https://www.w3.org/TR/webnn/#api-mloperand-create check dimensions steps:
… "2. If dimensions.length is 0, return false."
… recently two related proposed changes were suggested for consideration by Nignxin if we add this 0D scalar support:
… - make MLOperandDescriptor.dimensions a required field to allow distinguish the scalar and 1D tensor
… - drop the scalar-specific MLGraphBuilder.constant(value, dataType) variant, discussed in issue #475

<gb> Issue 475 Remove `builder.constant(value, dataType)` variant (by huningxin)

anssik: it looks like the first (make MLOperandDescriptor.dimensions a required field) should be rolled into the PR that adds support for 0D scalars (deletes the one line from check dimensions steps), correct?
… issue #475 to be addressed in a separate PR?

Dwayne: I sent a PR for this, not an enhancement per se, but the wording needs to be fixed

PR #476

<gb> Pull Request 476 Fix dimensions for 0D scalars (by fdwr)

Softmax axis absent

anssik: issue #466

<gb> Issue 466 Softmax axis absent (by fdwr)

anssik: issue reported by Wanming (thanks!) in ONNX Runtime WebNN EP review
… this is another great example of how our diverse impl experience informs the spec development, including feedback from web engine and framework implementations
… Dwayne summarized the issue well, TF/PT/ONNX all take an axis parameter, but WebNN's softmax does not
… platform support is good:
… - Apple Metal Performance Shaders softMax has an axis
… - Apple Model Intermediate Language (MIL) activation.softmax supports an axis
… - DirectML's DML_ACTIVATION_SOFTMAX1_OPERATOR_DESC supports an arbitrary axis list and dimensions
… to be investigated is XNNPACK, currently limited to 2D input
… Dwayne proposed two solutions: 1) update XNNPACK to accept an axis or 2) use existing XNNPACK operator plus a reshape or transpose
… Dwayne also proposes the required IDL changes in the issue

Chai: proposing we tackle these issues from impl and prototypes

Ningxin_Hu: question to Chrome folks, can we send a CL to prototype this and inform the spec effort+

Joshua_Bell: fine with that

<Ningxin_Hu> sounds great, thanks!

anssik: good to move ahead with a CL to help inform the spec design

split() into sizes not widely supported

anssik: issue #392

<gb> Issue 392 `split` into sizes are not widely supported (by huningxin)

anssik: this is an issue raised by Jiawei in implementation review (thanks!)
… WebNN split supports two variants:
… - "number of split" (when splits argument is an unsigned long)
… - "split into sizes" (when splits argument is a sequence<unsigned long>)
… however, some backends such as XNNPACK don't support "split into sizes"
… possible solutions proposed by Ningxin:
… - decompose split into multiple slices (per informative emulation path in the spec, see https://www.w3.org/TR/webnn/#api-mlgraphbuilder-split)
… - throw an error and leave the framework to handle
… Dwayne notes many frameworks support variable length splits: TensorFlow, PyTorch, ONNX
… and provides the following options expanding upon Ningxin's initial proposal:
… 1. Keep dedicated split
… 2. Backend decomposes variable lengths into slice calls
… 3. Front end decomposes variable lengths into slice calls
… 4. Throw error if backend doesn't support them

Dwayne: Does XNNPACK support variable size windows for concat, which is splits' symmetric pair operator?

<Ningxin_Hu> https://bugs.chromium.org/p/chromium/issues/detail?id=1492036

Ningxin_Hu: we also opened another Chromium issue to benchmark split vs slice for DirectML backend, this will inform the spec design decision
… decomp split may have an overhead, we want to understand that better, get data from the benchmark

Clarify the restriction for minValue and maxValue of MLClampOptions

anssik: issue #396

<gb> Issue 396 Clarify the restriction for `minValue` and `maxValue` of `MLClampOptions` (by huningxin)

anssik: Current behavior: "WebNN clamp limits the input tensor element-wise within a range specified by the minimum and maximum values. The minimum and maximum values could be specified by MLClampOptions"
… Google Security team's Alex asked "can min & max be == floats and is that ok or not?"
… The current Chromium implementation requires "min <= max"
… TF.js implements this restriction too
… however XNNPACK is stricter than that and requires "min < max"
… so the WebNN spec needs to clarify this restriction, Dwayne provided data on how other implementations are doing this, TF, PyTorch, ONNX, C++, DirectML
… and the result is all frameworks accept inverted ranges
… proposed solutions:
… 1) accept inverted ranges (and impl adjust min/max values before passing to backend)
… 2) reject min > max, let caller adjust min/max values before calling

Dwayne: this issue implies if backend does not support something, we should limit WebNN similarly
… there are ways WebNN can have one policy and backends can still implement them

Ningxin_Hu: frameworks behave differently, does that mean we should avoid different behavior?

Dwayne: first I want to check if this issue is about empty range or inverted range?

Ningxin_Hu: inverted range as in the examples

Dwayne: for consistency, WebNN backend should clamp the min and max values
… running WebNN on different backends would otherwise give inconsistent results

jsbell: we need to assume web developers won't test on every platform, so subtle differences will be considered bugs
… platform differences such as is this codec supported where one can query if a codec is supported, some level of feature detection, this we can expect from developers, not subtle differences
… WebNN is a bit different, developers are going through a framework, even there framework authors need to know about these subtle differences
… except in some cases where are high-level op may be supported or not, then feature detection may be OK
… we need to provide a lot of documentation to framework authors, we need to make sure the behaviour is consistent

RafaelCintron: I agree with jsbell, WebGPU is in a similar mode, high-level graphics API with multiple backends
… if backends behave differently, they discuss it in the group and usually majority wins, the minority backend gets polyfilled
… emulation may regress performance, so features are detectable

<Chai> ntd

RafaelCintron: feature detection is the last resort in WebGPU API

Dwayne: XNNPACK we can polyfill, ranges questions need to survey more backends

anssik: is this blocking the implementation?

Ningxin_Hu: current implementation behaviour need to be fixed in either XNNPACK or Chromium validation logic

– DRAFT –
WebML WG Teleconference – 2 November 2023

02 November 2023

Attendees