WebML WG Teleconference – 13 February 2025

Meeting minutes

Repository: webmachinelearning/webnn

<Joshua_Lochner> If we have ~5 mins near the end of the call, I'd love to share some work I have done with benchmarking in Transformers.js! (largest-scale web benchmarking effort?)

Announcements

<Joshua_Lochner> Yes this benchmarking suite is made in collaboration with a bunch of different teams :)

WebML Working Group Charter Advisory Committee review started

anssik: on 5 Feb 2025 W3C Advisory Committee representatives received a proposal to review the draft WebML WG charter we prepared together

Announcement

anssik: W3C invites comments on the proposed charter from the AC through 2025-03-06
… relaying Dom's message, please make sure your Advisory Committee representative remembers to vote
… please ping Dom or me if you need help identify your AC rep

WebML Community Group meetings to kick off

anssik: based on the scheduling poll results, we'll kick off the bi-weekly Community Group meetings on 25/26 Feb 2025
… on Tue 4-5 pm PST / Wed 00-01 am UTC / Wed 8-9 am CST / Wed 9-10 am JST
… continuing on a twice a month cadence, 11/12 Mar, 1/2 Apr, ... adjusting for holidays and breaks as appropriate
… I acknowledge this time is not great for EU-based participants
… so I'm establishing a work mode where we recap the most recent CG discussion in these WG meetings
… I've asked Etienne Noel to assume a role of a scribe for the CG meetings and he will produce a summary to keep the entire group abreast of discussions and will help facilitate the CG meetings
… I will work with Etienne on the CG agendas with input from the group

<jsbell> Etienne can't make *this* meeting today, apologies. But yes, he's shared an early draft of the agenda with me, looks good!

anssik: as the group's composition evolves, we may adjust the meeting time
… I'm excited to kick off these new incubations and I already see broader community engagement for these proposals
… any questions?

ML in Web Extensions and Transformers.js Benchmarking

anssik: I'm pleased to welcome Tarek Ziade and Tomislav Jovanovic from Mozilla to talk about Firefox AI Runtime
… a new experimental API in Firefox Nightly for running offline ML tasks in web extensions

Firefox AI Runtime

<Tarek> my FOSDEM slides https://docs.google.com/presentation/d/1M38WbRtb9dHfKFPlg7K0MTJ2mOYEorX0uWTKDFjLrqE/edit#slide=id.g82761e80df_0_1948

anssik: the thrust of this discussion is to understand the emerging usage of ML in web extensions
… given the different security model cf. the open web, this API can provide features that would not be feasible to do as is on the open web today, for privacy or security reasons
… I believe we can learn from this early experiment, and with further refinement some of these ideas could find their way to the open web and into API we're defining here
… one such feature being experimented in the web extensions context is model sharing across origins, would also benefit the WebNN greatly
… now I'll let Tarek and Tomislav share their story followed by a brief Q&A, timebox 10-15 mins

Tarek: we worked with Tomislav on this experiment
… thanks Anssi for a great summary
… I shared a link to FOSDEM presentation on the topic
… I will summarize the key takeaways from the FOSDEM presentation
… our project was to provide an inference API for offline usages surfaced to web extension developers
… allow developers experiment with the capability, understand for what such an API could be used for
… we added first inference-based feature to Firefox in 2019, translation using Bergamot and Marian MNT using Wasm, RNN model
… dedicated inference process, Wasm can run the inference task every time the user asks the web page to be translated
… specific piece of this is that for some kernel operations that are slow in Wasm, we use Wasm built-ins, directly into Firefox using Gemmlogy, faster without relying on Wasm SIMD implementation
… beyond translation, we provide more features running on browsers, e.g. image-text and other Transformers.js tasks
… we picked Transformers.js because it is easy to use, close to Python Transformers, all the tasks are provided, integration and experimentation simple
… Joshua made fantastic work on model coverage
… Transformers.js task-based API is easy to use
… Transformers.js in Firefox 133+
… implemented custom model cache, stores models in IndexedDB in x-origin manner
… we run our own inference process so easier to secure
… we have allow-deny filter to check model matches allowlist
… currently allows Mozilla's and Joshua's curated models
… we have pdf.js to get alt text for an image, using image-to-text model that is downloaded with pdf.js
… web extensions AI API is behind trial namespace, browser.trial.ml, the API can change or disappear at any time to highlight this is highly experimental
… documentation available on our website that demonstrated how to wrap the calls
… source tree has an example of a web extension that implement the get alt text of an image on right click
… the code is pretty simple, two phases:
… create an engine and then run the inference
… a bunch of events that can be triggered
… cached in IndexedDB

RafaelCintron: do you plan on making this API an official web standard or relegated to extensions?

Tarek: we don't yet know if this API would be useful, we want to use this experiment to first understand how people will use it and solicit feedback from folks, if useful we can propose this into browsers

<tomayac1> Shameless plug: https://blog.tomayac.com/2025/02/07/playing-with-ai-inference-in-firefox-web-extensions/

RafaelCintron: I'm asking because the API takes raw models from the internet, this group in the past experimented with Model Loader that takes as input a model, but did not progress with it due to lack of interoperable model format

Tarek: we abstract this away, we use a task-based abstraction that is model agnostic

RafaelCintron: which operators are available and what they do is another consideration

Tarek: there's a way to use a generic task e.g. image-to-text or if preferred, specify the model by its id, what is unspecified is the schema on how to interface with models

ningxin: you can run a model on CPU using Wasm, or GPU on WebGPU, how does the runtime decide on which device the inference runs on?

Tarek: right now the default is to run everything on CPU
… GPU had certain limitation on quantized models, e.g. Firefox does not support fp16 on WebGPU, int8 is used as quantization level
… we don't restrict developer to set the option, if the extension developer wants to use a specific GPU model, can pass a device option, and dtype to define the quantization
… as long as such a model is available on the hub

anssik: is there a fallback?

Tarek: get an error if explicitly asked for a specific model

<Joshua_Lochner> We're working on an "auto" device and dtype setting btw :)

<Joshua_Lochner> huggingface/transformers.js-benchmarking

Joshua_Lochner: I've been working on Transformers.js benchmarking
… a benchmarking suite for running these models across browsers, a collaboration with a team at Google and ONNX Runtime team and Tarek
… ideally the largest benchmarking suite for ONNX models, I've collected every single TF.js compatible ONNX model on HF Hub
… around 5100 models in total
… in the benchmarking suite, we define utility functions to stream models
… we identify ops used and number of downloads per month
… you see how many ops are used, e.g. Kokoro model uses the most ops currently
… it has some issues in Firefox and Android, we hope this helps browser developers understand what models are popular
… you can run tasks using the benchmarking tool, select the device used: Wasm, WebGPU, adding WebNN

anssik: thanks Joshua for the world reveal!

<Tarek> really nice work!

<DwayneR> Do you have an info column for which models are pure ONNX domain? I saw some that use unofficial contrib ops like SkipLayerNormalization, MatMulNBits.

Joshua_Lochner: not in the CSV file, but in packages/core/src/operators.js in the repo we have the list defined

<DwayneR> Thanks - answered.

<ningxin> The transformers.js ops would be super useful, thanks Joshua for sharing!

Tarek: great work, thanks Joshua!
… for some models, ONNX is optimizing the graph on the fly and uses other ops than on the graph
… do you read the graph statically, or can you identify graph ops that are executed after optimization step?
… currently static, reading the header of ONNX file, until it reaches the boundary
… not doing optimization of the bat, for many of the optimized llama models do optimization ahead of time

anssik: I'll invite you to your future meeting to have an update on this topic of benchmarking

Device selection

Remove MLDeviceType

anssik: PR #809

<gb> Pull Request 809 Remove MLDeviceType (by zolkis) [device selection]

Device Selection Explainer

anssik: I asked the group to conduct a final review for the spec PR to remove MLDeviceType as outlined in the explainer
… thank you Zoltan for the PR and reviewers for your feedback and suggestions!
… summary of changes:
… - remove MLDeviceType and related prose
… - update "create a context" algorithm, removing any MLDeviceType references
… - be explicit MLContextOptions is a hint
… PR #809 closes issues #302 #350 #749

<gb> Issue 749 MLContextOptions.deviceType seems unnecessary outside of conformance testing (by mwyrzykowski) [device selection]

<gb> Issue 350 Need to understand how WebNN supports implementation that involves multiple devices and timelines (by wchao1115) [question] [webgpu interop]

<gb> CLOSED Issue 302 API simplification: context types, context options, createContext() (by zolkis) [device selection]

anssik: the query mechanism design was spun off into a separate issue, and will be discussed next
… this split allows us to address the device selection problem space in a piecemeal fashion
… the PR has been reviewed and approved, we are ready to merge it, thanks all!
… any questions or comments?

Zoltan: this is the first step, the next step is to work on the Apple's proposal for opLimits, in between came the new use case for quary mechanism
… it is good to merge this PR now and work on the query mechanism and opLimits on top of this PR

anssik: any concerns in merging the PR?

Mike_W: OK to merge

anssik: we can proceed to merge PR #809

<gb> Pull Request 809 Remove MLDeviceType (by zolkis) [device selection]

Query mechanism

anssik: issue #815

<gb> Issue 815 Post-context creation query mechanism for supported devices (by anssiko) [device selection]

anssik: I spun off this issue from PR #809
… the premise of this issue is to enable querying the context for what kind of devices the context actually supports rather than prescribing a preferred device type
… I put forward a simple API proposal as a discussion starter
… first, I think it is important to understand the use cases for such a feature
… then, second, we can compress the possible solution space by what can be implemented in an interoperable and future-proof manner
… third, the more specific information we expose through the API, bigger the fingerprint, also conceptual weight affecting developer ergonomics
… also good to remember sometimes the best solution is no new API surface at all i.e. let the implementation decide
… to balance that extreme, I also hear we cannot expect an implementation to handle everything and some sort of query mechanism would be warranted
… now, it is up to discussion what is the right level of abstraction
… I'd ask us to think about use cases first, implementability second, and the API shape third
… the initial use cases:
… - allow the web app to select a model optimized for a device type ahead of compile time
… - allow the web app to fall back to WebGPU shaders if "high performance" WebNN is not available

Markus: one use case, if you imagine real-time processing with an app, e.g. background blur, and we have a model that is advanced, it run on an acceptable performance on Apple Silicon or Intel, but there's something on the model that inhibits execution on NPU and we fall back to CPU
… experimenting locally with WebNN, there's 10X difference in CPU and GPU execution in some test cases, in such cases we'd fall back to a model that is lower-resolution or use Wasm
… it is also important for us that the query mechanism is not slow, it will have impact on UX

anssik: query mechanism should be available before the model is downloaded or compiled?

Markus: before downloading the model would be good to know if e.g. only CPU is supported

anssik: I see this as a three-stage process:
… - 1) hints provided at context-creation time (I want a "high-performance" context)
… - 2) device availability after context creation ("does the context have a GPU?")
… - 3) device capabilities after compile ("can you actually run this model?")

Zoltan: there's also 0) pre-context creation query

RafaelCintron: question to Markus, you said you experimented with WebNN with HW acceleration and fallback to CPU had a performance impact

Markus: in Core ML there was some bug with Core ML backend that caused fall back to CPU

anssik: sounds like an implementation bug

RafaelCintron: if WebNN is available it should always do as good job as WebGPU

<DwayneR> Model data type (like float16) can make a big perf difference, and so knowing before model download is useful.

<DwayneR> float32: runs on both GPU and CPU, but not any many NPU's

<DwayneR> float16: works on NPU. On GPU offers ~2x boost over float32, but much slower on CPU because CPU's lack dedicated float16 hardware.

<DwayneR> int8: works well on CPU and NPU, but then GPU's suffer.

RafaelCintron: we should continue improve the WebNN implementation to ensure that is the case

Zoltan: use case is to limit fallbacks?

Markus: yes, fail with error would be better

Mike: I want to say that from Apple's perspective there's support for query mechanism and what models to download, but think names "gpu" and "npu" don't add helpful information
… a query mechanism should be based on capabilities and not device names

Zoltan: it is possible to specify a generic mechanism

anssik: it feels like a future-proof solution might be to provide a high-level abstraction that stands the test of time
… can we come up with an expressive enough set of high-level future-proof hints that allow meaningful mapping to concrete devices?
… instead of hard-coding the hints-to-device mapping, I'd envision an implementation would create a "symbolic link" from (a set of) these hints to concrete device(s) that would be maintained, and the mapping could evolve over time
… hints such as "high-performance" and imaginary "full-precision" together might map to a GPU
… in the future version of the same implementation or on another system, the same hints could map to a different device

– DRAFT –
WebML WG Teleconference – 13 February 2025

13 February 2025

Attendees