WebML WG Teleconference – 5 December 2024

Meeting minutes

Repository: webmachinelearning/webnn

WebNN Operator Update Wave 3

anssik: at TPAC 2024 the group resolved to update spec with Wave 3 operators
… the details are in Dwayne's excellent presentation:

WebNN Operator Update Wave 3 (slides)

TPAC 2024 resolution and discussion

anssik: implementation status tracker for these Wave 3 ops has been updated:

Implementation Status incl. Wave 3 ops

anssik: per implementation status all Wave 3 ops are implemented across >=2 backends and used by at least one framework
… a few observations:
… - gatherElements and scatterElements use emulation paths on LiteRT backend (denoted with "Emulated with")
… - notEqual op was not implemented
… overall Wave 3 ops map nicely to backends, anyone want to share implementation experience across WebNN backends and frameworks?

Dwayne: everything is implemented except notEqual
… WebNN EP in ORT has complete coverage except notEqual

ningxin: ORT Web reverse is still work in progress, this was introduced later, not in the initial proposal, this was a requirement for a customer model that we felt was important to incorporate
… development of that op has some delay, but we see no issues in implementing that in ORT, can be implemented with ONNX slice with negative step

<ningxin> webmachinelearning/webnn#767

<gb> Issue 767 Request the decomposition for gatherElements, scatterElements and scatterND (by fujunwei) [operator specific]

ningxin: for Chromium TFLite implementation, scatterElement and gatherElement have no corresponding ops, use emulation paths
… specifically for constant indices variants we have emulation paths
… this is partial support, we need some further work to have full support for these two emulations
… an issue opened against TFLite to query interest to implement these

Austin: on CoreML, most of the Wave 3 ops are implemented
… some are entirely decompositions, e.g. LSTM is all decomp
… remaining Wave 3 ops are quantization ops, de/quantizeLinear and reverse
… some constraints in dequantizeLinear, the scale needs to be positive
… for the most part, in WebNN now, any input is valid as long as it is of correct shape and data type
… we don't inspect content of the input
… if garbage is passed, we validate, if negative values is passed in scales tensor need to decide what to do
… all these tensors are required to be constant by the backend
… I can file an issue about this

MLTensor specification updates

anssik: The MLTensor design in the explainer form is now being converted to specification prose, thanks Austin!

MLTensor explainer

anssik: PR #787 landed that specified key methods related to MLTensor, while "hand-waving over much of the juicy details" (quoting Austin) :-)

<gb> MERGED Pull Request 787 Specify MLTensor (by a-sully)

anssik: PR #786 (linked in the agenda) was superseded by PR #795 that removed both compute() method and outdated MLContext requirements in this same PR

<gb> CLOSED Pull Request 786 Remove descriptions of outdated MLContext requirements (by a-sully)

<gb> MERGED Pull Request 795 Remove the MLContext.compute() method (by a-sully)

Austin: juicy details are about more detailed discussion on what timelines are, the PR defines behaviour, we don't specify what the timeline is
… there are some constraints on how operation can run when multiple graphs are at play
… planning how to best specify these, we could pull in some logic from IndexedDB spec
… methods should be good to go

Timelines

anssik: issue #529 is where the "juicy details" of MLContext's timeline are discussed

<gb> Issue 529 Specify WebNN timelines (by a-sully) [webgpu interop]

anssik: Josh and Zoltan noted the spec needs tightening in how it explains cross-thread, cross-process work
… HTML spec provides abstractions for:

event loop ("main thread")

in parallel ("on a background thread")

task source ("separate logically-different types of tasks")

task queue ("coalesce task sources within a given event loop")

anssik: WebGPU spec defines timelines that "clearly define both the order of operations, and which state is available to which operations":

Content timeline ("Associated with the execution of the Web script")

Device timeline ("Associated with the GPU device operations")

Queue timeline ("Associated with the execution of operations")
… these WebGPU timelines can be used (only?) for GPU device type?
… possible work items from this issue:
… - decide and define what new timelines are required
… - define behavior of the new timelines on implementations that involve multiple devices and timelines?
… - define interaction of the new timelines with WebGPU's timelines?

Austin: no need to define timelines in terms of WebGPU for this spec, we can probably go with "content" and "context" timelines for WebNN purposes
… big question in my mind, how handwavy we should/can be, we have these concepts such as task queues and task sources
… we want to interleave our timeline with WebGPU, some of these primitives are associated with web things e.g. execution contexts
… open question: can we use existing primitives, or be ultra-hand-wavy when we talk to WebGPU?

Rafael: I'm onboard with Austin, no need to specify as in WebGPU, but need "device timeline" that runs separate from CPU, true for 2D Canvas, WebGL/GPU, e.g. draw is async

MikeW: content timelines is on the web side, queue is the timeline of the actual chip, in WebNN case it'd be CPU, GPU, or NPU, separate execution paths

Austin: WebNN would not need these three distinctions because it is not tied to GPU so closely

MikeW: starting with two timelines for WebNN would be reasonable, in some cases we could see that not specifying this could introduce data races

<ningxin> microsoft/webnn-developer-preview#67

<gb> Pull Request 67 Support iobinding for Whisper Base demo on WebNN GPU (by miaobin)

ningxin: want to share a use case for MLTensor in WebNN Whisper demo
… MLTensor was used to optimize Whisper demo with good speedup on GPU, up to 50% inference perf improvement

Non-fatal errors

anssi: issue #778

<gb> Issue 778 Proposal: Report non-fatal errors from the WebNN timeline (by a-sully) [feature request]

anssi: proposal to report non-fatal errors from the WebNN timeline
… includes problem statement, current state, observations, proposal with tentative IDL provided, thanks Austin!
… thumbs up from Ningxin and Reilly
… good insights from Bryan and Rafael on what does WebGPU do to resources involved in non-fatal errors
… also a suggestion from Rafael to introduce an "errorScope" a la WebGPU or error queue with errors referring to labeled objects

Austin: thank you for the feedback Rafael and Ningxin!
… given how WebGPU handles errors, I'm wondering whether doing something like cascading errors on tensors as proposed, create a promise that resolves when a graph is destroyed, does that get us far enough?
… so that we would not need to do cascading errors on MLTensor?
… are we fine starting with this similar approach?

Rafael: a simple solution would be a promise on the graph, and if invalidated the graph could not be used for anything?
… and we could do that for various backends?

Austin: potential for false positives there, if you fail to dispatch usually that's the right thing to do, consistency for developers

Rafael: that should be fine for now, what advise to give when that happens? Try another model?

Austin: no good solution, in such a case the graph is invalid
… unfortunately can just tell developers "try something else"

Rafael: which platforms see this?

Austin: TFLite and CoreML, TFLite errors are implementation gaps
… if you don't clamp it is a runtime exception
… CoreML has other instances, e.g. negative scales throw a runtime exception
… trying to allocate some buffers while dispatching, we lose the context
… should lose the graph instead

Rafael: do we have bugs filed against CoreML?

Austin: we can file more of them

ningxin: question to Austin, non-fatal errors for values in tensors, can we introduce clamp value for indices?

Austin: I agree there are some cases in CoreML and TFLite backend where we don't clamp indices and that causes runtime error, implementation issue that'll be fixed
… a class of errors in the "State of the world" section, class 2 errors, you're trying to allocate resources and that fails
… or you compile the model and clear site data that may blow away the representation of the model stored on disk
… both those errors and implementation issues

ningxin: thanks, that makes sense

Austin: next steps to respond to Rafael latest comment and update the proposal to note we start adding promise to the graph and reject the promise if MLGraph becomes invalidated

WebML Community Group new deliverables kick off

Repository: webmachinelearning/charter

anssik: The WebML Community Group charter review ended with explicit support and no objections. Minor editorial changes were suggested, all addressed.

Results announcement

Updated Charter

anssik: Summary of changes:
… - Refresh Goals
… - Add Task-specific APIs (Translator and Language Detector APIs, Writing Assistance APIs) and Prompt API to Deliverables
… - Note WebNN has graduated to the WG
… - Stylistic tweaks
… since new Task-specific APIs and Prompt API are the most significant changes, I've asked Etienne Noel from Google to present the latest on these new APIs and to give us a refresher to the topic
… and hear additional updates on new ideas, the launch status of specific APIs currently in Origin Trial, and progress made since last update at TPAC.
… and have some time for questions at the end

https://bit.ly/tpac2024-builtinai

WICG/translation-api

WICG/writing-assistance-apis

explainers-by-googlers/prompt-api

Slideset: built-in-apis-slides

[slide 1]

[slide 2]

[slide 3]

Etienne: built-in APIs complementary to WebNN

[slide 4]

[slide 5]

Etienne: strong interest from developers

[slide 6]

Etienne: some developers want high-level APIs for ease of use
… sharing models across origins a hard problem

[slide 7]

Etienne: Prompt API is complicated, hard to stardardize
… OT only for extensions for Prompt API

Etienne: why not Prompt API? Standardizing it is hard
… task-specific APIs easier to standardize
… for some use cases, we cannot have specific API for
… want to provide cohesive unified APIs
… open questions: how to support all the devices on the web?
… in the future this will change but now limited
… proposed solution, can be hybrid, client-side or cloud-based
… privacy and fingerprinting, mitigations using permission prompts so that users stay in control
… interoperability, quality, i18n considerations
… next steps
… want to discuss these proposals with group participants and other browser vendors to shape this area

christianliebel: language and translation use different models?

Etienne: correct, on Android using the same models you'd use when offline, will consider LLM later

christianliebel1: Writer and Rewriter?

Etienne: using Gemini Nano for those at this time

Joshua_Lochner: wanted to ask, when it comes to using custom models, or more up to date models, e.g. latest LLaMA is released and gets versioned in the API and the previous one is outdated, users want to use the latest and some users don't want to upgrade, or fine-tune their own
… any thought how on how to version the models?

Etienne: that is definitely something we should discuss in this Community Group and see what it takes to support that in the model ecosystem for browsers

Joshua_Lochner: adapters, LoRas, use cases for those would be amazing
… I'm working on adding support for these features too

explainers-by-googlers/prompt-api

zkis: you need to adapt to device capabilities in model selection, how you could control what model to use in the Chrome version you use, there could be multiple versions, do you want to open up the model management?
… have discussed in the CG, so wonder if that should be work for the CG?

Etienne: not opposed to the idea, should look at the use cases

Etienne: need to think privacy and fingerprinting impact

zkis: how to adapt to local devices

RafaelCintron: presentation says, "Prompt API is good, hard to standardize", do you want to have it for browsers?

Etienne: we want to standardize the other task-specific APIs first
… if strong need for Prompt API, we can adjust priorities

RafaelCintron: one thing we see with Prompt API, wild variations what various implementation produce

Etienne: structured output for Prompt API is being experimented as a solution to that problem

<etienne> Summarizer API: OT M131 to M136

<etienne> https://chromestatus.com/feature/5193953788559360

<etienne> Rewriter API and Writer API: Hoping for OT soon.

<etienne> Translate API: OT M131 to M136

<etienne> https://chromestatus.com/feature/5172811302961152

<etienne> Language Detector: OT M130 to M135

<etienne> https://chromestatus.com/feature/6494349985841152

<etienne> Prompt API: Origin Trial for Extensions

<etienne> https://developer.chrome.com/docs/extensions/ai/prompt-api

Device selection abstractions

anssik: PR #784 explainer

<gb> Issue 784 not found

anssik: would like to check the group's readiness to converge on the proposal to allow progress toward a prototype early next year
… explainer documents 3 considered alternatives:

Considered alternatives

anssik: I heard the group prefers a minimal solution, does not want to over engineer

anssik: it looks like Option 1 is "the most MVP":

"Keep the current MLDeviceType as a context option, but improve the device type names and specify an algorithm for a mapping these names to various real adaptors (with their given characteristics). However, this would be more limited than being able to specify device specific limits to context creation."

anssik: in addition, for all paths in the spec where there's an error returned or thrown in response to MLDeviceType, we'd change to a fallback path instead.

Repository: webmachinelearning/webnn

feedback welcome via PR #784 for the device selection explainer

<gb> Pull Request 784 Add device selection explainer (WiP) (by zolkis)

– DRAFT –
WebML WG Teleconference – 5 December 2024

05 December 2024

Attendees