WebML WG Teleconference – 7 March 2024

Meeting minutes

Repository: webmachinelearning/webnn

anssik: Welcome to a new participant, Ilya Rezvov from Google

Ilya: I work for Google on Wasm mostly, recently looking at ML-related efforts, fp16 in Wasm specifically and want to extend my interests to other areas of ML on the web

Call for Consensus: WebNN API CR Snapshot

anssik: On 28 Feb 2024 we issued a Call for Consensus (CfC) to publish the Web Neural Network API as a new Candidate Recommendation Snapshot (CRS).

-> CfC to publish WebNN API Candidate Recommendation Snapshot - review by 7 Mar 2024

https://lists.w3.org/Archives/Public/public-webmachinelearning-wg/2024Feb/0006.html
… anssik: the WG has received explicit support for this CfC and no concerns have been raised
… our CR readiness tracker #240 shows all green except for the WIP TAG delta review
… the TAG delta review in flight may raise some questions at the transition time so we should be prepared for that
… that said, I hope we can address this in flight by noting this publication in fact explicitly addresses earlier TAG review feedback by removing support for synchronous execution
… I will handle CRS transition logistics with Dom and will ask the WG for further information as needed
… we can still merge the currently open PRs before branching for this release, it is important that after each commit the spec remains in a cohesive state
… transition request processing is expected to take a week, so earliest publication date would be on the week 18-22 March 2024

<gb> Issue 240 Candidate Recommendation readiness tracker (by anssiko) [process]

Hybrid AI exploration

anssik: As you recall, we have a sister WebML Community Group responsible for incubating new ideas for future work, the CG works closely with this WG, sharing many participants
… the WebML CG has received a new proposal called "Hybrid AI exploration":

webmachinelearning/proposals#5

<gb> Issue 5 Hybrid AI Exploration (by grgustaf)

anssik: so I wanted to invite WebML CG participants Michael, Geoff, Sudeep to present this proposal to solicit input from this WG and inform the direction this exploration should take
… the timebox is ~20 minutes including Q&A
… any concrete proposals from this exploration that may impact Web APIs are expected to be incubated in an applicable Community Group first

Slideset: https://lists.w3.org/Archives/Public/www-archive/2024Mar/att-0000/WebML.Discussion.-.Hybrid.AI.for.the.Web.-.Slides.pdf

[Slide 1]

Michael: Proposal is titled "Hybrid AI for the Web", probably a bit mistitled, we're looking at general model management too
… started work to improve the fit of WebNN on the client and want to make sure we look at the right problems, we're not proposing solutions at this stage

[Slide 2]

Michael: first going through the general status as we undersrand it, specific issues, goals and requirements we see, prioritization, closing with questions for this group

[Slide 3]

Michael: looked at WebNN use cases, client AI execution clearly in focus
… we found some problems, e.g. language translation requires large models, long download times, need to figure out client capabilities
… startup time may be significant
… if we have two different web sites using a model we need to download a model twice,
… clients vary in capabilities, vary in time, clients grow rapidly in performance and want to avoid least common denominator approach

[Slide 4]

Michael: Specific issues, three broader categories:
… 1) Model Management
… - Large models cannot be reused across origins
… - Model storage and management opaque to the user
… - Cache eviction may not match user preferences
… 2) Elasticity through Hybrid AI
… - Distributing work between client and server
… - Difficult to predict performance on a client
… - Sharing detailed client capabilities a privacy risk
… (noting possible overlap with PWA caching mechanisms)
… 3) User Experience
… - Privacy behaviour unclear, not match user preferences
… - Managing latency of model downloads

[Slide 5]

Michael: Goals and Requirements
… Maximize ease of use for the end user
… - minimize load times and meet latency targets
… Portability and elasticity
… - minimize costs, support varying client capabilities, adapt based on resource availability
… Data privacy
… - Personal and biz data, support user choice and control
… last but not least, developer ease of use and consistency

[Slide 6]

Michael: Questions for Discussion
… How to:
… handle model download latency and storage?
… match model reqs to client capabilities?
… choose among model fidelity levels?
… support progressive transmission of models?
… partition single models, support separate models, both?
… Questions to the group:
… - what should be the priorities?
… - Specific use cases for Hybrid AI?

[Slide 7]

Michael: Proposed Next Steps
… 1) Make sure we solve the right problem
… We welcome your feedback via the GH issue submitted to the proposals repo:

webmachinelearning/proposals#5

<gb> Issue 5 Hybrid AI Exploration (by grgustaf)

Michael: 2) Build a prototype implementation
… e.g. using the Model Loader API from the CG as a basis, we have some ideas to test

Model Loader API

Michael: 3) Bring results back to the group to discuss further

RafaelCintron: is there a solution in mind?

Michael: caching strategy on the computational graph, negotiating model requirements and client capablities are a few ideas

anssik: we have discussed Storage APIs for caching large models in this group earlier

anssik: I recall Joshua Lochner has prototyped solutions to cross-site sharing of models with a browser extension

Joshua_Lochner: from Transformers.js perspective we're fortunate that the browser caching API is pretty performant, can do 1.5B param model and refresh the page and it loads from the cache
… however, the issues emerges when I go to another web site on a different origin, my extension idea works but it requires the user to download a random extension, extra effort for the user, not a standards-based feature
… for the size issue, I'm focused on smaller models that can perform in Wasm environment, soon WebGPU
… storage and issues related to exceptionally big models has not been the main focus, 50M to 250M parameters has been the focus of Transformers.js, sweet point
… due to the cache issues
… the main API is Web Cache API, models loaded as HTTP request-response pair from HuggingFace Hub

Michael: are you caching serializations of the models, ONNX files?
… any ideas re adapters?

Joshua_Lochner: caching serializations, single .onnx files, in the future separate graph and weight in separate files
… for adapters, haven't thought about it yet
… ONNX RT is rather limited in this sense, if we want to use an adapter we need to export the whole model
… MMS text-to-speech model was one example where an adapter is at the end
… what we have been able to do is split up the models, only works if the backend is identical, e.g. text-gen model, chop of the head
… that's one way to share weights, not exactly the way you're proposing

Michael: the topology is smaller chunk we're concerned of weight caching at this stage
… progressive transmitting creates questions regarding the API if it's read-only, the best now is to create zero nodes and add weight later
… to progressively enhance the model we need to built it from the scratch

Michael: the prototype will likely be similar to HuggingFace solution, for good UX may need some standardization work later in the future

Michael: what use cases can make use of different level of fidelity?
… big-little mapping to server-client, any specific use cases for Hybrid AI?

Joshua_Lochner: I guess some form of a personalization model that learns things over time, continuous training, a model where you can update the weights over time, private personalization preference learning model
… e.g. you're on Twitter and want to block certain things
… you're probably referring to much larger things, multiple LoRAs on top llama

Michael: fine-tuning on the client, split out the LoRAs so we can select them, a lot of these optimizations are relevant to big models mainly, smaller models are faster to download as is

Joshua_Lochner: another use case, some form of underlying embeddings adapting, speech-to-text, text-to-speech, base model stays the same

webmachinelearning/proposals#5

<gb> Issue 5 Hybrid AI Exploration (by grgustaf)

Open issues and PRs

anssik: as usual, let's discuss open issues and review PRs based on your feedback

All open issues

All pull requests

Triage guidance

Core operator set

<zkis> webmachinelearning/webnn#573

<gb> Issue 573 Core operator set (by philloooo) [question] [opset]

anssik: issue #573

<gb> Issue 573 Core operator set (by philloooo) [question] [opset]

jsbell: we're working on an additional contribution
… in principle, if there are emerging standards from the ecosystem, we should look at them as well to inform our work, describe them separately

Consider using label to allow better error handling for async errors

anssik: issue #585

<gb> Issue 585 Consider using `label` to allow better error handling for async errors. (by philloooo) [feature request]

jsbell: different between sync and async errors, in the initial implementation a lot of validation happens syncronously in the renderer process because XNNPACK is also running in that process
… sync errors can be moved earlier, async errors are hard for developers and also frameworks to handle
… thinking how to report those errors, with promise rejection, how to know which node is responsible for the error?
… proposed solution to follow WebGPU’s practice to define a MLObjectBase with a label field to let MLOperand extend from

WebGPUObjectBase.label

jsbell: when an async error is raised them the developer has useful information about the reason
… more interesting if decomp or fusion is done
… Zoltan proposed we could auto-gen these labels
… we're interested in any feedback

zkis: I just agree with the latest comment from Josh

RafaelCintron: wanted to say I'm in favour of labels and sync
… I was also a proponent of .label WebGPU

anssik: jsbell Google folks interested in implementing this?

jsbell: yes
… what happens if developers call build and async build is happening and code modifies the label of an operand, does the build step snapshot all the labels

Dwayne: for debugging this is very helpful
… what is the format of labels, it would be helpful to raise the errors sooner than later, I wonder if you can know all the backend capabilities early, can do pre-validation, overall like the idea

zkis: we could keep generated labels separate from user-provided labels
… to be discused in the issue

Rename inputSize variables as inputRank in algorithms

anssik: issue #588

<gb> Issue 588 Rename inputSize variables as inputRank in algorithms (by inexorabletash) [conventions]

jsbell: this is very simple, comments welcome

Consider alternate styling/linking for method argument definitions

anssik: issue #574

<gb> Issue 574 Consider alternate styling/linking for method argument definitions (by inexorabletash) [question] [editorial] [conventions]

anssik: question regarding styling method arguments with three alternatives:
… Alternative 1: Make args into definitions
… Alternative 2: Style as definition list
… Alternative 3: Auto-generated table

<anssik> s/… the WG/anssik: the WG

– DRAFT –
WebML WG Teleconference – 7 March 2024

07 March 2024

Attendees

Meeting minutes

Call for Consensus: WebNN API CR Snapshot

Hybrid AI exploration

Open issues and PRs

Core operator set

Consider using label to allow better error handling for async errors

Rename inputSize variables as inputRank in algorithms

Consider alternate styling/linking for method argument definitions

Diagnostics